Monday, September 22, 2025

Target Encoding & One-Hot Encoding

I had completely overlooked these concepts earlier on, and somehow not until just now - this late in the game - did they come up as I was reading some of the advanced sections of the Jansen Algorithmic Trading book.

When you run a model, there are always going to be things you do NOT want to include. For many reasons (circular correlation, redundancy, etc). Generally, you would not want to include symbol, sector and industry in your model because the symbol after all, is nothing more than an index in the data - and the symbol in and of itself should have NOTHING to do with predicting returns.

So we keep an Exclusion array of metrics we don't want included:
EXCLUDED_FEATURES = [
    # 'symbol', 
    'date', 'year', 'quarter', 'close', 'fwdreturn',  # identity + target
    # 'sector', 'industry',  optional: include via one-hot or embeddings if desired
...
]

Symbol, Sector and Industry were initially excluded because they are text strings, and models don't like text data (some won't run without only numerics, some give warnings, others unpredictable results if they tolerate it).

Then I read about target encodings for symbols - which allow you to KEEP symbols in the data, and something called "one-hot" encodings for things like sectors and industries.

What these encodings do, is they take categorical data - like sectors and industries - and they convert them to a numeric so that they will be considered by the model. This captures the association between category and target, sometimes boosting model power.

So - adding the code to do this wasn't terribly painful. The symbol was encoded as a single encoded target feature (symbol_enc) - and removed from the above-mentioned EXCLUDED_FEATURES list.  The code to one-hot encode the Sector and Industry was snapped in, and off we went...

The R-squared dropped from .11 to .06. Yikes...
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4989, Validation: 0.0776, Test: 0.0673
PRUNED R² -- Train: 0.2216, Validation: 0.0348, Test: 0.0496
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0673
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2133, MAE: 0.1471


I quickly realized, that by one-hot encoding the Industry, we got so many feature category permutations of the industries, that it undoubtedly screwed up the model.

I decided to run again, and only one-hot encode the Sector. So now with this run, we have a target encoding for symbol, and a one-hot encoding for sector. There are not that many sectors, so this doesn't become too unwieldy.

But - here is what we got on our SHAP analysis after this change:
🧮 Feature Selection Overview:
✅  consumer_sentiment_×_earnings_yield                SHAP=0.04511
✅  dilutedEPS                                         SHAP=0.02566
✅  rev_grw_4q                                         SHAP=0.01191
✅  evToSales                                          SHAP=0.01181
✅  roe                                                SHAP=0.01071
✅  symbol_enc                                         SHAP=0.01055
✅  business_confidence_×_capexToRevenue               SHAP=0.00925
✅  treasury_spread_x_debt_to_equity                   SHAP=0.00867
✅  cpippi_marginsqueeze                               SHAP=0.00719
✅  earningsYield                                      SHAP=0.00703
✅  inflation_x_debt_to_equity                         SHAP=0.00701
⛔  ppi_x_capextorevenue                               SHAP=0.00628
⛔  operatingCashFlow_to_totalAssets                   SHAP=0.00610
⛔  momentum_1y                                        SHAP=0.00586
⛔  gross_margin                                       SHAP=0.00575
⛔  realized_vol_3m                                    SHAP=0.00543
⛔  long_term_debt_to_equity                           SHAP=0.00526
⛔  rev_grw_qoq                                        SHAP=0.00500
⛔  vix_×_debt_to_equity                               SHAP=0.00480
⛔  goodwill_to_totalAssets                            SHAP=0.00432
⛔  cpi_x_netprofitmargin                              SHAP=0.00371
⛔  incomeQuality                                      SHAP=0.00360
⛔  rev_grw_qoq_to_totalAssets                         SHAP=0.00341
⛔  netDebtToEBITDA                                    SHAP=0.00340
⛔  evToFreeCashFlow                                   SHAP=0.00304
⛔  debt_to_equity                                     SHAP=0.00303
⛔  salesGeneralAndAdministrativeToRevenue             SHAP=0.00288
⛔  vix_x_evtoebitda                                   SHAP=0.00273
⛔  rev_grw_pop_sector_z                               SHAP=0.00268
⛔  eps_grw_qoq                                        SHAP=0.00258
⛔  evToEBITDA                                         SHAP=0.00258
⛔  interestBurden                                     SHAP=0.00255
⛔  momentum_1y_sector_z                               SHAP=0.00252
⛔  evToEBITDA_sector_z                                SHAP=0.00237
⛔  asset_turnover                                     SHAP=0.00235
⛔  totalEquity                                        SHAP=0.00231
⛔  net_margin                                         SHAP=0.00228
⛔  workingCapital                                     SHAP=0.00218
⛔  cc_delinquency_rate_×_debt_ratio                   SHAP=0.00209
⛔  inventory_turnover                                 SHAP=0.00209
⛔  capexToRevenue                                     SHAP=0.00197
⛔  freeCashFlowPerShare                               SHAP=0.00191
⛔  daysOfInventoryOutstanding_sector_z                SHAP=0.00186
⛔  fcf_ps_grw_qoq                                     SHAP=0.00186
⛔  eps_grw_qoq_to_totalAssets                         SHAP=0.00182
⛔  daysOfInventoryOutstanding                         SHAP=0.00180
⛔  daysOfPayablesOutstanding                          SHAP=0.00179
⛔  capexToRevenue_sector_z                            SHAP=0.00179
⛔  operatingCashFlow                                  SHAP=0.00173
⛔  totalEquity_to_totalAssets                         SHAP=0.00163
⛔  stockBasedCompensationToRevenue                    SHAP=0.00159
⛔  cash_ratio                                         SHAP=0.00158
⛔  evToOperatingCashFlow                              SHAP=0.00156
⛔  debt_ratio                                         SHAP=0.00156
⛔  eps_grw_4q                                         SHAP=0.00155
⛔  receivables_turnover                               SHAP=0.00153
⛔  ocf_to_current_liabilities                         SHAP=0.00151
⛔  totalDebt                                          SHAP=0.00146
⛔  operating_margin                                   SHAP=0.00143
⛔  debt_ratio_sector_z                                SHAP=0.00143
⛔  workingCapital_to_totalAssets                      SHAP=0.00135
⛔  operatingReturnOnAssets                            SHAP=0.00130
⛔  daysOfSalesOutstanding                             SHAP=0.00129
⛔  ordinary_shares                                    SHAP=0.00128
⛔  roa                                                SHAP=0.00127
⛔  earnings_surprise                                  SHAP=0.00122
⛔  bookValue                                          SHAP=0.00118
⛔  totalLiabilities                                   SHAP=0.00114
⛔  fcf_ps_grw_qoq_to_totalAssets                      SHAP=0.00113
⛔  goodwill                                           SHAP=0.00112
⛔  fcf_ps_grw_4q                                      SHAP=0.00106
⛔  eps_grw_pop_sector_z                               SHAP=0.00100
⛔  freeCashFlow                                       SHAP=0.00090
⛔  current_ratio                                      SHAP=0.00090
⛔  ebitda_margin                                      SHAP=0.00087
⛔  totalRevenue                                       SHAP=0.00082
⛔  quick_ratio_sector_z                               SHAP=0.00078
⛔  free_cash_flow                                     SHAP=0.00073
⛔  ocf_to_total_liabilities                           SHAP=0.00062
⛔  sector_Energy                                      SHAP=0.00062
⛔  returnOnTangibleAssets                             SHAP=0.00059
⛔  avgDilutedShares                                   SHAP=0.00050
⛔  sector_Basic Materials                             SHAP=0.00034
⛔  sector_Technology                                  SHAP=0.00025
⛔  sector_Real Estate                                 SHAP=0.00025
⛔  sector_Financial Services                          SHAP=0.00019
⛔  sector_Industrials                                 SHAP=0.00011
⛔  sector_Consumer Cyclical                           SHAP=0.00010
⛔  sector_Communication Services                      SHAP=0.00009
⛔  sector_Healthcare                                  SHAP=0.00007
⛔  sector_Utilities                                   SHAP=0.00004
⛔  sector_Consumer Defensive                          SHAP=0.00002
⛔  fcf_ps_grw_pop                                     SHAP=0.00000
⛔  rev_grw_pop                                        SHAP=0.00000
⛔  eps_grw_pop                                        SHAP=0.00000
⛔  unemp_x_rev_grw                                    SHAP=0.00000
⛔  ocf_to_net_income                                  SHAP=0.00000
⛔  quick_ratio                                        SHAP=0.00000
⛔  current_ratio_sector_z                             SHAP=0.00000

The symbol target encoding? SHAP likes it. It moves the proverbial needle.

But when it comes to the sector and industry, the one-hot encodings, neither of those seem to be adding any real benefit.

Our R-squared was better than the one we had with one-hot encoded industries - but less than our previous and best runs with NO encodings as shown below.

Symbol Target-Encoded and Sector One-Hot Encoded:
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4708, Validation: 0.0757, Test: 0.0868
PRUNED R² -- Train: 0.3847, Validation: 0.0779, Test: 0.0486
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0868
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2154, MAE: 0.1485
Feature importance summary:
  → Total features evaluated: 78

So - one last run, with only the Symbol Target Encoding (not Sector and Industry)
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.5169, Validation: 0.0788, Test: 0.0463
PRUNED R² -- Train: 0.2876, Validation: 0.0456, Test: 0.0083
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0463
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2154, MAE: 0.1479
Feature importance summary:
  → Total features evaluated: 78

And...back to no symbol, sector, industry
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4876, Validation: 0.1095, Test: 0.0892
PRUNED R² -- Train: 0.2870, Validation: 0.0713, Test: 0.0547
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0892
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2188, MAE: 0.1486

Once again - no encoding at all on symbol, sector and industry...
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4985, Validation: 0.1051, Test: 0.0969
PRUNED R² -- Train: 0.2742, Validation: 0.0981, Test: 0.0869
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0969
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2160, MAE: 0.1487
Feature importance summary:
  → Total features evaluated: 78

Conclusion: These encodings did not move the R-squared at all, and in fact moved it backwards. So I need to make sure that there is no benefit to doing this encoding and if not, I will leave those out as the model was running originally.

I Have More Data Now - Enough for an AI RNN LSTM Model?

I have a LOT more data now than I did before. And an advanced architecture to process it.

Should I consider an RNN?

I knew I couldn't really pull it off with the Annual data I had - because by the time you split the data for training, validation, and testing there isn't enough to feed the algorithm.  

But - Quarterly! I have a LOT of quarterly data now, many statements per symbol across quarter-dates. ~70K rows of data!!!

So let's try doing an LSTM....I wrote a standalone LSTM, using Keras. Just a few lines of code. 

One important note about this! 

Do NOT mix your data processing, and or your XGBoost code, with neural network code!!! ALWAYS create a brand new virtual environment for your neural RNN code, because if you choose Keras or the other competing frameworks, they will require specific versions of Python libraries that may conflict with your data processing and/or XGBoost libraries!

Now. With that important disclaimer, the small sample of code. We will highlight in blue since my blog tool apparently has no code block format.

# -------------------------
# Train/test split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

# -------------------------
# Build LSTM model
# -------------------------
model = Sequential()
model.add(LSTM(32, input_shape=(SEQ_LEN, len(feature_cols)), return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dense(1))  # regression output

model.compile(optimizer='adam', loss='mse')

# Early stopping
es = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# -------------------------
# Train
# -------------------------
history = model.fit(
    X_train, y_train,
    validation_split=0.1,
    epochs=50,
    batch_size=16,
    callbacks=[es],
    verbose=1
)

# -------------------------
# Evaluate
# -------------------------
y_pred = model.predict(X_test).flatten()
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"R²: {r2:.3f}, MAE: {mae:.3f}")

Well, how did it go?


The previous XGBoost r-squared value, was .11-.12 consistently. Now, we are getting .17-.19. This is a noticeable significant improvement!
 

Changing the Ensemble Model to a Stacked Meta Ensemble

 
Earlier we had a weighted ensemble model that essentially took the r-squared values of Annual and Quarterly and used that as a weighting factor to ensemble them.

It was here, that  realized we were not calculating or saving the predicted fwd return - we were only calculating scores, writing them to a scoring summary and saving the R-squared.

So I changed things around. I added a stacked meta ensemble, and will describe how these work below. We now run BOTH of these.

Weighted Ensemble

  • A simple blend of the two base models.
  •  Annual and quarterly predictions are combined with weights proportional to their out-of-sample R² performance.

Result: ensemble_pred_fwdreturn and ensemble_pred_fwdreturn_pct.

This improves stability but is still fairly “rigid.”


Meta-Model Ensemble (Stacked Ensemble)

A second-level model (XGBoost) is trained on:

  1. Predictions from the annual model
  2. Predictions from the quarterly model
  3. Additional features (sector, industry, etc.)

This meta-model learns the optimal way to combine signals dynamically rather than relying on fixed weights.

Result: ensemble_pred_fwdreturn_meta and ensemble_pred_fwdreturn_meta_pct.

How well did it work?
Results

  1. Weighted Ensemble: R² ~0.19, Spearman ~0.50
  2. Meta-Model Ensemble: R² ~0.75, Spearman ~0.65

Quintile backtests confirm a strong monotonic relationship between predicted quintiles and realized forward returns.

Friday, September 5, 2025

I Need More Financial Quant Data - Techniques On How To Get It

I may have posted earlier about how finding enough data - for free - is extreeeemely difficult.

Even if you can find it, ensuring the integrity of it can cost time money and cycles that make it so much simpler to just let someone else deal with it and just subscribe.  Problem is, I am "subscriptioned out". I can't keep adding layers upon layers of subscriptions, because that money adds up.

So - I work hard to see what data is available out there (i.e. Kaggle). It makes no sense to waste processing cycles and bandwidth if someone has already cultivated that data and is willing to share it.

I also have learned that there are a lot of bots out there that screen-scrape, using tools like Beautiful Soup. And if you are clever enough to use layers of (secure - I can't stress that enough) proxies, and morph your digital fingerprint (i.e. changing up browser headers and such), you can go out there and find data, and save it - and even check the integrity of the data by checking it against a couple or three sources. 

And don't forget rate-limiting and Cloudflare tools - you have to figure out how to evade those as well. It's a chess game, and one that seemingly never ends.

Anyway - I decided I needed quarterly data. My XGBoost model just wasn't computing the way I wanted. I added more interactive features from macro data, and even a "graph score" (see earlier posts). And indeed, the score - the R-squared score - came up - but it didn't get to where I wanted it, and the list of stock picks were not stocks that I would personally make an investment in.

I decided to do two things:

  1. Find superior data source(s) - preferably where I could get more and better quarterly data - for free.
  2. Consolidate the code so that I didn't have to manage and sync code that was fetching on one frequency (annual) vs another.

I underestimated these tasks. Greatly.

I found a Github project that could hit different data sources. It had an OO design - and was probably over-engineered IMHO. But, I got what the author was after - by using a base class and then plugging in different "interface classes", you could maybe switch back and forth between different data sources. 

So I tried it. And, lo and behold it didn't work. At first it did - for annual statements. But after I downloaded about 8k quarterly statements, I was horrified to realize that all of the quarterly statements were clones of the annual statements. Wow what a waste!!!

I checked - and the quarterly data was there indeed. The Github code was flawed. So, I fixed it. And even enhanced it. 

This is the first time I have actually contributed to a community Github project. I am familiar with Git and Github, but if you are not doing this kind of thing on the regular, you have to re-learn topics such as branch development, Pull Requests, Merges, etc. And perhaps one of the most annoying things, is that the upstream owner of the repository may not like or agree with your changes.  

In this particular case,  the repo owner was using property decorators. Well, those work fine if you don't have parameters in your functions, because when you try to reference attributes of a class, it doesn't work if the calls have parameters in them. I had to blow those out. He didn't seem happy about it - but, eventually, he seemed to acknowledge the need to do that.  Another difference of opinion had to do with the fact that he was using a lru_cache decorator on calls. I wasn't up to speed on this, and had to read up on it, and concluded that this was NOT the right situation to use caching, let alone lru caching. It can speed things up TREMENDOUSLY in the right use cases - but if you are batch downloading thousands of statements for thousands of symbols, you are not going to need to consult a cache for every symbol, so a cache like that actually creates overhead - and risk (i.e. running out of resources like memory if you don't have a max size on the cache).

In the end, I have some code that works. I had to do a rebase and update the pull request, and, if he doesn't take these changes the way I wrote them and need them, I guess I can always just create my own repo and go solo on this.  I would rather not, because the repo owner does synch his repository with the pip installer which makes it easy to download and update. 

  

Thursday, August 28, 2025

Latest Changes to XGBoost Quant Financial Model

 

Latest Changes:

Added new Macros to my model  - CPI, PPI, VIX (last change I made was to add a beats/meets/misses surprise score a few weeks ago)

I added some interactive features based on these (5 in total). I have learned that these interactives move the model predictability like nothing else - which is why I added more. 

    R-squared score came back up to approaching .3 now with these.

    My correlation is still inverted from 1 yr fwd return - even more so. So I flip the score.

One major change was that I added a "graph_score" which was a nightmare to produce. LLMs can NOT seem to handle this task AT ALL. So I finally had to do it mostly myself, and got something working fairly well - it recognizes good graphs vs bad graphs and downscores bad graph patterns.

I forked the stockdex github project, and am making some changes that I will re-submit back with a git pull. The macrotrends datasource could not handle quarterly data. Once I realized it could be done, I decided to enhance the code to do this so I could run the model and get more quarterly statements alongside the pack of annual statements.  

Once I get the model run with annual+quarterly, I will probably retire this project and move onto more recent and LLM-based stuff.

Friday, August 1, 2025

AI / ML - Data Source Comparison with More Data

"To validate generalizability, I ran the same model architecture against two datasets: a limited Yahoo-based dataset and a deeper Stockdex/Macrotrends dataset. Not only did the model trained on Stockdex data achieve a higher R² score (benefiting from more years of data), but the feature importances and pillar scores remained largely stable between the two. This consistency reinforces the robustness of the feature selection process and confirms that the model is learning persistent financial patterns, not artifacts of a specific data source or time window."

Details:

I wanted to move to an LSTM model to predict stock returns, but I was fortunate and patience enough to really read-up and plan this, before just diving into the deep end of the pool.

I learned that the "true AI" models - Transformers, RNNs, et al (of which LSTM is a subclass), require more data. I didn't have anywhere enough data using Yahoo, which gives 4-5 years (at best) of data. And because I was calculating momentum, yoy growth and such, I would always lose one of the years (rows) right off the bat - a significant percentage of already-scarce data.

So, in digging around, I found a Python library called stockdex. It is architected to be able to use multiple data sets, but the default is macrotrends. 

But using this library and source this left several challenges:

  1. No quarterly data in the Python API - although the website does have a "Format" drop down for Quarterly and Annual.
  2. The data was pivoted from Yahoo data. Yahoo put columns as items (x), and the time periods as rows (y). This stockdex API downloaded it opposite. 
  3. The stockdex had no "names" for the items. 

Ultimately, I decided to use this because it returned a LOT more years of data.

  1. First, I put some code together to download the raw data (statements), and then "pivot" the data to match Yahoo's format. 
  2. Then, I used a mapping approach to change the columns from Macrotrends to Yahoo - so that I didn't have to change my logic that parsed statements.
  3. I did have to do a run-tweak on the Metrics and Ratios, and fix certain columns that were not coming in correctly.
  4. Lastly, I ran the model - same one as Yahoo and was able to keep the model logic essentially unchanged. 

The training took a LOT longer on Stockdex. The combined train+val had 14,436 rows on it.
Here is what we got:
FULL R² -- Train: 0.8277, Validation: 0.3441, Test: 0.3537
PRUNED R² -- Train: 0.7714, Validation: 0.3146, Test: 0.3429
Selected FULL model based on test R².
Final Model Test Metrics -- R²: 0.3537, RMSE: 0.3315, MAE: 0.2282
Feature importance summary:
  → Total features evaluated: 79
  → Non-zero importance features: 75

The model running and scoring, it took a very very long time. Finally it came out with this Top 25 list.
Top 25 Stocks by Final adj_pillar_score:
     symbol  adj_pillar_score  improvement_bonus  pillar_score
1209    RUN          1.156474               0.05      1.106474
884     LZM          1.020226               0.05      0.970226
97     ARBK          1.018518               0.00      1.018518
277     CGC          1.009068               0.02      0.989068
262     CCM          0.982228               0.02      0.962228
821    KUKE          0.964131               0.00      0.964131
1415   TRIB          0.963591               0.02      0.943591
1473   UXIN          0.961206               0.05      0.911206
571    FWRD          0.957156               0.05      0.907156
859    LOCL          0.935929               0.00      0.935929
1069   OTLY          0.896289               0.00      0.896289
894     MBI          0.895565               0.05      0.845565
1159   QDEL          0.890248               0.05      0.840248
1039    ODV          0.861127               0.00      0.861127
1522    WBX          0.860391               0.00      0.860391
1578   ZEPP          0.856097               0.02      0.836097
860    LOGC          0.846546               0.05      0.796546
990     NIO          0.811563               0.02      0.791563
1428    TSE          0.775067               0.05      0.725067
930    MODV          0.773322               0.05      0.723322
817    KRNY          0.770282               0.05      0.720282
1545    WNC          0.767113               0.02      0.747113
65     ALUR          0.756362               0.00      0.756362
813    KPTI          0.749644               0.05      0.699644
1316   SRFM          0.743651               0.00      0.743651

Then I ran the smaller Yahoo model:
Training pruned model...
FULL R² -- Train: 0.8181, Validation: 0.3613, Test: 0.2503
PRUNED R² -- Train: 0.8310, Validation: 0.3765, Test: 0.2693
Selected PRUNED model based on test R².
Final Model Test Metrics -- R²: 0.2693, RMSE: 0.4091, MAE: 0.2606
Feature importance summary:
  → Total features evaluated: 30
  → Non-zero importance features: 30

And, the Top 25 report for that one looks like this:
Top 25 Stocks by Final adj_pillar_score:
     symbol  adj_pillar_score  improvement_bonus  pillar_score
907    MOGU          1.532545               0.05      1.482545
1233    SKE          1.345178               0.05      1.295178
1170   RPTX          1.334966               0.02      1.314966
419      DQ          1.305644               0.05      1.255644
1211    SES          1.280886               0.05      1.230886
702    IFRX          1.259426               0.00      1.259426
908    MOLN          1.244191               0.02      1.224191
1161   RLYB          1.237648               0.00      1.237648
176    BHVN          1.232199               0.05      1.182199
512    FEDU          1.218868               0.05      1.168868
977    NPWR          1.205679               0.00      1.205679
1533     YQ          1.204367               0.02      1.184367
11     ABUS          1.201539               0.00      1.201539
58     ALLK          1.192839               0.02      1.172839
1249    SMR          1.154462               0.00      1.154462
63     ALXO          1.148672               0.02      1.128672
1482    WBX          1.140147               0.00      1.140147
987    NUVB          1.139138               0.00      1.139138
1128     QS          1.130001               0.02      1.110001
864     LZM          1.098632               0.00      1.098632
16     ACHR          1.094872               0.02      1.074872
1176    RUN          1.059293               0.02      1.039293
758    JMIA          1.053711               0.00      1.053711
94     ARBK          1.049382               0.00      1.049382
1086   PHVS          1.039269               0.05      0.989269

Symbols that appear in both top 25 lists:

  • RUN (Stockdex rank 1, Yahoo rank 23)

  • LZM (Stockdex rank 2, Yahoo rank 21)

  • ARBK (Stockdex rank 3, Yahoo rank 24)

  • WBX (Stockdex rank 16, Yahoo rank 17)

interesting...

Comparing the top sector reports:

Side-by-side Overlap Analysis Approach

SectorSymbol(s) (Overlap)Stockdex Rank & ScoreYahoo Rank & ScoreNotes
Basic MaterialsLZM, ODV, MAGNLZM #1 (1.0202), ODV #2 (0.8611), MAGN #4 (0.7287)LZM #2 (1.0986), ODV #3 (0.9233), MAGN #5 (0.8681)Close agreement; Yahoo scores higher overall
Communication ServicesKUKE, FUBOKUKE #1 (0.9641), FUBO #4 (0.5936)KUKE #5 (0.5559), FUBO #4 (0.6544)Generally consistent rank order
Consumer CyclicalNIO, UXIN, LOGCUXIN #1 (0.9612), LOGC #2 (0.8465), NIO #3 (0.8116)NIO #5 (0.9276), SES #2 (1.2809) not in StockdexPartial overlap; Yahoo picks also include SES, MOGU
Consumer DefensiveYQ, OTLY, LOCLLOCL #1 (0.9359), OTLY #2 (0.8963), YQ #4 (0.6482)YQ #2 (1.2044), FEDU #1 (1.2189), LOCL missingSome overlap, differences in top picks
EnergyDWSN, PBFDWSN #2 (0.6201), PBF #3 (0.4305)DWSN #1 (0.7613), PBF #3 (0.3556)Rankings closely aligned
Financial ServicesARBK, MBI, KRNYARBK #1 (1.0185), MBI #2 (0.8956), KRNY #3 (0.7703)ARBK #1 (1.0494), GREE #2 (0.9502) missing MBI, KRNYPartial overlap
HealthcareCGC, CCM, TRIBCGC #1 (1.0091), CCM #2 (0.9822), TRIB #3 (0.9636)RPTX #1 (1.3350), IFRX #2 (1.2594) missing CGC,etcDifferent picks mostly
IndustrialsFWRD, EVTLFWRD #1 (0.9572), EVTL #4 (0.7314)NPWR #1 (1.2057), EVTL #5 (0.9657)Some overlap
Real EstateOPAD, AIVAIV #1 (0.7303), OPAD #2 (0.7286)DOUG #1 (0.8116), OPAD #5 (0.5578)Partial overlap
TechnologyRUN, WBX, ZEPPRUN #1 (1.1565), WBX #2 (0.8604), ZEPP #3 (0.8561)WBX #2 (1.1401), RUN #3 (1.0593), ZEPP missingStrong agreement
UtilitiesAQNAQN #1 (0.7382)OKLO #1 (0.8269) no AQNDifferent picks

So - this is excellent model validation, I think. We see some differences due to the amount of time-period data we have, but the results are not widly different. 

I think I can now use this data in LSTM perhaps. Or whatever my next steps turn out to be, because I may - before LSTM - try to do some earnings transcript parsing for these if it's possible.


AI / ML - Modeling Fundamentals - Mistakes Found and Corrected

After adding Earnings Surprise Score data into my 1 year fwd return predicting model, I kind of felt as though I had hit the road on the model. The Earnings Surprise Score did move the needle. But with all of the effort in Feature Engineering I had put into this model, the only thing I really felt I could add to it, was sentiment (news). Given that news is more of a real-time concept, grows stale, and would be relevant for only the latest row of data, I decided to do some final reviews, and move on, or "graduate" to some new things - like maybe trying out a neural network or doing more current or real-time analysis. In fact, I had already tried a Quarterly model, but the R-squared on it was terrible and I decided not to use it - not to even ensemble it with the annual report data model.

So - I asked a few different LLMs to code review my model. And I was horrified to learn that because of using LLMs to continually tweak my model, I had wound up with issues related to "Snippet Integration". 

Specifically, I had some major issues:

1. Train/Test Split Happened Too Early or Multiple Times

  •  Splitting data before full preprocessing (e.g., before feature scaling, imputation, or log-transforming the target).
  •  Redundant train/test splits defined in multiple places — some commented, some active — leading to potential inconsistencies depending on which was used.


2. No Validation Set

  •  Originally, data was split into training and test sets.
    •  This meant that model tuning (e.g. SHAP threshold, hyperparameter selection) was inadvertently leaking test set information. 

  •  Now corrected with a clean train/val/test split.


3. Inconsistent Preprocessing Between Train and Test

  •  Preprocessing steps like imputation, outlier clipping, or scaling were not always applied after the split.
  •  This risked information bleeding from test → train, violating standard ML practice.


4. Improper Handling of Invalid Target Values (fwdreturn)

  •  NaN, inf, and unrealistic values (like ≤ -100%) were not being consistently filtered.
  •  This led to silent corruption of both training and evaluation scores.
  •  Now fixed with a strict invalid_mask and logging/reporting of dropped rows.


5. Redundant or Conflicting Feature Definitions

  •  There were multiple, overlapping blocks like:

                features = [col for col in df.columns if ...]
                X = df[features]

  •  This made it unclear which feature list was actually being used.
  •  Sector z-scored and raw versions were sometimes duplicated or mixed without clarity.


6. Scaling and Z-Scoring Logic Was Not Modular or Controlled

  •  Originally, some features were being z-scored after asset-scaling (which didn’t make sense).
  •  Some metrics were scaled both to assets and z-scored sector-wise, which polluted the modeling signal.
  •  Now addressed with clear separation and feature naming conventions.


7. SHAP Was Applied to a Noisy or Unclean Feature Space

  •  Without proper pruning first (e.g., dropping all-NaN columns), SHAP feature importance included irrelevant or broken columns.
    •  This could inflate feature count or misguide model interpretation.
  •  Now resolved by cleaning feature set before SHAP and applying SHAP-based selection only on valid, imputed data.
One issue, was that the code for modeling was in the main() function which had gotten way too long and hard to read. This function had all of the training, testing, splitting/pruning, it had the model fitting in it, AND all of the scoring. I broke out the process of train/validate/test - and put that process into play for both the full model, as well as the SHAP-pruned model. Then I took the best r-squared from both runs and used the winning model.

Target Encoding & One-Hot Encoding

I had completely overlooked these concepts earlier on, and somehow not until just now - this late in the game - did they come up as I was re...