Grasping Technology

Monday, September 22, 2025

Target Encoding & One-Hot Encoding

I had completely overlooked these concepts earlier on, and somehow not until just now - this late in the game - did they come up as I was reading some of the advanced sections of the Jansen Algorithmic Trading book.

When you run a model, there are always going to be things you do NOT want to include. For many reasons (circular correlation, redundancy, etc). Generally, you would not want to include symbol, sector and industry in your model because the symbol after all, is nothing more than an index in the data - and the symbol in and of itself should have NOTHING to do with predicting returns.

So we keep an Exclusion array of metrics we don't want included:
EXCLUDED_FEATURES = [
# 'symbol',
'date', 'year', 'quarter', 'close', 'fwdreturn', # identity + target
# 'sector', 'industry', optional: include via one-hot or embeddings if desired
...
]

Symbol, Sector and Industry were initially excluded because they are text strings, and models don't like text data (some won't run without only numerics, some give warnings, others unpredictable results if they tolerate it).

Then I read about target encodings for symbols - which allow you to KEEP symbols in the data, and something called "one-hot" encodings for things like sectors and industries.

What these encodings do, is they take categorical data - like sectors and industries - and they convert them to a numeric so that they will be considered by the model. This captures the association between category and target, sometimes boosting model power.

So - adding the code to do this wasn't terribly painful. The symbol was encoded as a single encoded target feature (symbol_enc) - and removed from the above-mentioned EXCLUDED_FEATURES list. The code to one-hot encode the Sector and Industry was snapped in, and off we went...

The R-squared dropped from .11 to .06. Yikes...
🪓 Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4989, Validation: 0.0776, Test: 0.0673
PRUNED R² -- Train: 0.2216, Validation: 0.0348, Test: 0.0496
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0673
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2133, MAE: 0.1471

I quickly realized, that by one-hot encoding the Industry, we got so many feature category permutations of the industries, that it undoubtedly screwed up the model.

I decided to run again, and only one-hot encode the Sector. So now with this run, we have a target encoding for symbol, and a one-hot encoding for sector. There are not that many sectors, so this doesn't become too unwieldy.

But - here is what we got on our SHAP analysis after this change:
🧮 Feature Selection Overview:
✅ consumer_sentiment_×_earnings_yield SHAP=0.04511
✅ dilutedEPS SHAP=0.02566
✅ rev_grw_4q SHAP=0.01191
✅ evToSales SHAP=0.01181
✅ roe SHAP=0.01071
✅ symbol_enc SHAP=0.01055
✅ business_confidence_×_capexToRevenue SHAP=0.00925
✅ treasury_spread_x_debt_to_equity SHAP=0.00867
✅ cpippi_marginsqueeze SHAP=0.00719
✅ earningsYield SHAP=0.00703
✅ inflation_x_debt_to_equity SHAP=0.00701
⛔ ppi_x_capextorevenue SHAP=0.00628
⛔ operatingCashFlow_to_totalAssets SHAP=0.00610
⛔ momentum_1y SHAP=0.00586
⛔ gross_margin SHAP=0.00575
⛔ realized_vol_3m SHAP=0.00543
⛔ long_term_debt_to_equity SHAP=0.00526
⛔ rev_grw_qoq SHAP=0.00500
⛔ vix_×_debt_to_equity SHAP=0.00480
⛔ goodwill_to_totalAssets SHAP=0.00432
⛔ cpi_x_netprofitmargin SHAP=0.00371
⛔ incomeQuality SHAP=0.00360
⛔ rev_grw_qoq_to_totalAssets SHAP=0.00341
⛔ netDebtToEBITDA SHAP=0.00340
⛔ evToFreeCashFlow SHAP=0.00304
⛔ debt_to_equity SHAP=0.00303
⛔ salesGeneralAndAdministrativeToRevenue SHAP=0.00288
⛔ vix_x_evtoebitda SHAP=0.00273
⛔ rev_grw_pop_sector_z SHAP=0.00268
⛔ eps_grw_qoq SHAP=0.00258
⛔ evToEBITDA SHAP=0.00258
⛔ interestBurden SHAP=0.00255
⛔ momentum_1y_sector_z SHAP=0.00252
⛔ evToEBITDA_sector_z SHAP=0.00237
⛔ asset_turnover SHAP=0.00235
⛔ totalEquity SHAP=0.00231
⛔ net_margin SHAP=0.00228
⛔ workingCapital SHAP=0.00218
⛔ cc_delinquency_rate_×_debt_ratio SHAP=0.00209
⛔ inventory_turnover SHAP=0.00209
⛔ capexToRevenue SHAP=0.00197
⛔ freeCashFlowPerShare SHAP=0.00191
⛔ daysOfInventoryOutstanding_sector_z SHAP=0.00186
⛔ fcf_ps_grw_qoq SHAP=0.00186
⛔ eps_grw_qoq_to_totalAssets SHAP=0.00182
⛔ daysOfInventoryOutstanding SHAP=0.00180
⛔ daysOfPayablesOutstanding SHAP=0.00179
⛔ capexToRevenue_sector_z SHAP=0.00179
⛔ operatingCashFlow SHAP=0.00173
⛔ totalEquity_to_totalAssets SHAP=0.00163
⛔ stockBasedCompensationToRevenue SHAP=0.00159
⛔ cash_ratio SHAP=0.00158
⛔ evToOperatingCashFlow SHAP=0.00156
⛔ debt_ratio SHAP=0.00156
⛔ eps_grw_4q SHAP=0.00155
⛔ receivables_turnover SHAP=0.00153
⛔ ocf_to_current_liabilities SHAP=0.00151
⛔ totalDebt SHAP=0.00146
⛔ operating_margin SHAP=0.00143
⛔ debt_ratio_sector_z SHAP=0.00143
⛔ workingCapital_to_totalAssets SHAP=0.00135
⛔ operatingReturnOnAssets SHAP=0.00130
⛔ daysOfSalesOutstanding SHAP=0.00129
⛔ ordinary_shares SHAP=0.00128
⛔ roa SHAP=0.00127
⛔ earnings_surprise SHAP=0.00122
⛔ bookValue SHAP=0.00118
⛔ totalLiabilities SHAP=0.00114
⛔ fcf_ps_grw_qoq_to_totalAssets SHAP=0.00113
⛔ goodwill SHAP=0.00112
⛔ fcf_ps_grw_4q SHAP=0.00106
⛔ eps_grw_pop_sector_z SHAP=0.00100
⛔ freeCashFlow SHAP=0.00090
⛔ current_ratio SHAP=0.00090
⛔ ebitda_margin SHAP=0.00087
⛔ totalRevenue SHAP=0.00082
⛔ quick_ratio_sector_z SHAP=0.00078
⛔ free_cash_flow SHAP=0.00073
⛔ ocf_to_total_liabilities SHAP=0.00062
⛔ sector_Energy SHAP=0.00062
⛔ returnOnTangibleAssets SHAP=0.00059
⛔ avgDilutedShares SHAP=0.00050
⛔ sector_Basic Materials SHAP=0.00034
⛔ sector_Technology SHAP=0.00025
⛔ sector_Real Estate SHAP=0.00025
⛔ sector_Financial Services SHAP=0.00019
⛔ sector_Industrials SHAP=0.00011
⛔ sector_Consumer Cyclical SHAP=0.00010
⛔ sector_Communication Services SHAP=0.00009
⛔ sector_Healthcare SHAP=0.00007
⛔ sector_Utilities SHAP=0.00004
⛔ sector_Consumer Defensive SHAP=0.00002
⛔ fcf_ps_grw_pop SHAP=0.00000
⛔ rev_grw_pop SHAP=0.00000
⛔ eps_grw_pop SHAP=0.00000
⛔ unemp_x_rev_grw SHAP=0.00000
⛔ ocf_to_net_income SHAP=0.00000
⛔ quick_ratio SHAP=0.00000
⛔ current_ratio_sector_z SHAP=0.00000

The symbol target encoding? SHAP likes it. It moves the proverbial needle.

But when it comes to the sector and industry, the one-hot encodings, neither of those seem to be adding any real benefit.

Our R-squared was better than the one we had with one-hot encoded industries - but less than our previous and best runs with NO encodings as shown below.

Symbol Target-Encoded and Sector One-Hot Encoded:
🪓 Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4708, Validation: 0.0757, Test: 0.0868
PRUNED R² -- Train: 0.3847, Validation: 0.0779, Test: 0.0486
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0868
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2154, MAE: 0.1485
Feature importance summary:
→ Total features evaluated: 78

So - one last run, with only the Symbol Target Encoding (not Sector and Industry)
🪓 Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.5169, Validation: 0.0788, Test: 0.0463
PRUNED R² -- Train: 0.2876, Validation: 0.0456, Test: 0.0083
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0463
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2154, MAE: 0.1479
Feature importance summary:
→ Total features evaluated: 78

And...back to no symbol, sector, industry
🪓 Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4876, Validation: 0.1095, Test: 0.0892
PRUNED R² -- Train: 0.2870, Validation: 0.0713, Test: 0.0547
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0892
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2188, MAE: 0.1486

Once again - no encoding at all on symbol, sector and industry...
🪓 Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4985, Validation: 0.1051, Test: 0.0969
PRUNED R² -- Train: 0.2742, Validation: 0.0981, Test: 0.0869
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0969
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2160, MAE: 0.1487
Feature importance summary:
→ Total features evaluated: 78

Conclusion: These encodings did not move the R-squared at all, and in fact moved it backwards. So I need to make sure that there is no benefit to doing this encoding and if not, I will leave those out as the model was running originally.

I Have More Data Now - Enough for an AI RNN LSTM Model?

I have a LOT more data now than I did before. And an advanced architecture to process it.

Should I consider an RNN?

I knew I couldn't really pull it off with the Annual data I had - because by the time you split the data for training, validation, and testing there isn't enough to feed the algorithm.

But - Quarterly! I have a LOT of quarterly data now, many statements per symbol across quarter-dates. ~70K rows of data!!!

So let's try doing an LSTM....I wrote a standalone LSTM, using Keras. Just a few lines of code.

One important note about this!

Do NOT mix your data processing, and or your XGBoost code, with neural network code!!! ALWAYS create a brand new virtual environment for your neural RNN code, because if you choose Keras or the other competing frameworks, they will require specific versions of Python libraries that may conflict with your data processing and/or XGBoost libraries!

Now. With that important disclaimer, the small sample of code. We will highlight in blue since my blog tool apparently has no code block format.

# -------------------------
# Train/test split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

# -------------------------
# Build LSTM model
# -------------------------
model = Sequential()
model.add(LSTM(32, input_shape=(SEQ_LEN, len(feature_cols)), return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dense(1)) # regression output

model.compile(optimizer='adam', loss='mse')

# Early stopping
es = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# -------------------------
# Train
# -------------------------
history = model.fit(
X_train, y_train,
validation_split=0.1,
epochs=50,
batch_size=16,
callbacks=[es],
verbose=1
)

# -------------------------
# Evaluate
# -------------------------
y_pred = model.predict(X_test).flatten()
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"R²: {r2:.3f}, MAE: {mae:.3f}")

Well, how did it go?

The previous XGBoost r-squared value, was .11-.12 consistently. Now, we are getting .17-.19. This is a noticeable significant improvement!

Changing the Ensemble Model to a Stacked Meta Ensemble

Earlier we had a weighted ensemble model that essentially took the r-squared values of Annual and Quarterly and used that as a weighting factor to ensemble them.

It was here, that realized we were not calculating or saving the predicted fwd return - we were only calculating scores, writing them to a scoring summary and saving the R-squared.

So I changed things around. I added a stacked meta ensemble, and will describe how these work below. We now run BOTH of these.

Weighted Ensemble

A simple blend of the two base models.
Annual and quarterly predictions are combined with weights proportional to their out-of-sample R² performance.

Result: ensemble_pred_fwdreturn and ensemble_pred_fwdreturn_pct.

This improves stability but is still fairly “rigid.”

Meta-Model Ensemble (Stacked Ensemble)

A second-level model (XGBoost) is trained on:

Predictions from the annual model
Predictions from the quarterly model
Additional features (sector, industry, etc.)

This meta-model learns the optimal way to combine signals dynamically rather than relying on fixed weights.

Result: ensemble_pred_fwdreturn_meta and ensemble_pred_fwdreturn_meta_pct.

How well did it work?
Results

Weighted Ensemble: R² ~0.19, Spearman ~0.50
Meta-Model Ensemble: R² ~0.75, Spearman ~0.65

Quintile backtests confirm a strong monotonic relationship between predicted quintiles and realized forward returns.

Friday, September 5, 2025

I Need More Financial Quant Data - Techniques On How To Get It

I may have posted earlier about how finding enough data - for free - is extreeeemely difficult.

Even if you can find it, ensuring the integrity of it can cost time money and cycles that make it so much simpler to just let someone else deal with it and just subscribe. Problem is, I am "subscriptioned out". I can't keep adding layers upon layers of subscriptions, because that money adds up.

So - I work hard to see what data is available out there (i.e. Kaggle). It makes no sense to waste processing cycles and bandwidth if someone has already cultivated that data and is willing to share it.

I also have learned that there are a lot of bots out there that screen-scrape, using tools like Beautiful Soup. And if you are clever enough to use layers of (secure - I can't stress that enough) proxies, and morph your digital fingerprint (i.e. changing up browser headers and such), you can go out there and find data, and save it - and even check the integrity of the data by checking it against a couple or three sources.

And don't forget rate-limiting and Cloudflare tools - you have to figure out how to evade those as well. It's a chess game, and one that seemingly never ends.

Anyway - I decided I needed quarterly data. My XGBoost model just wasn't computing the way I wanted. I added more interactive features from macro data, and even a "graph score" (see earlier posts). And indeed, the score - the R-squared score - came up - but it didn't get to where I wanted it, and the list of stock picks were not stocks that I would personally make an investment in.

I decided to do two things:

Find superior data source(s) - preferably where I could get more and better quarterly data - for free.
Consolidate the code so that I didn't have to manage and sync code that was fetching on one frequency (annual) vs another.

I underestimated these tasks. Greatly.

I found a Github project that could hit different data sources. It had an OO design - and was probably over-engineered IMHO. But, I got what the author was after - by using a base class and then plugging in different "interface classes", you could maybe switch back and forth between different data sources.

So I tried it. And, lo and behold it didn't work. At first it did - for annual statements. But after I downloaded about 8k quarterly statements, I was horrified to realize that all of the quarterly statements were clones of the annual statements. Wow what a waste!!!

I checked - and the quarterly data was there indeed. The Github code was flawed. So, I fixed it. And even enhanced it.

This is the first time I have actually contributed to a community Github project. I am familiar with Git and Github, but if you are not doing this kind of thing on the regular, you have to re-learn topics such as branch development, Pull Requests, Merges, etc. And perhaps one of the most annoying things, is that the upstream owner of the repository may not like or agree with your changes.

In this particular case, the repo owner was using property decorators. Well, those work fine if you don't have parameters in your functions, because when you try to reference attributes of a class, it doesn't work if the calls have parameters in them. I had to blow those out. He didn't seem happy about it - but, eventually, he seemed to acknowledge the need to do that. Another difference of opinion had to do with the fact that he was using a lru_cache decorator on calls. I wasn't up to speed on this, and had to read up on it, and concluded that this was NOT the right situation to use caching, let alone lru caching. It can speed things up TREMENDOUSLY in the right use cases - but if you are batch downloading thousands of statements for thousands of symbols, you are not going to need to consult a cache for every symbol, so a cache like that actually creates overhead - and risk (i.e. running out of resources like memory if you don't have a max size on the cache).

In the end, I have some code that works. I had to do a rebase and update the pull request, and, if he doesn't take these changes the way I wrote them and need them, I guess I can always just create my own repo and go solo on this. I would rather not, because the repo owner does synch his repository with the pip installer which makes it easy to download and update.

Thursday, August 28, 2025

Latest Changes to XGBoost Quant Financial Model

Latest Changes:

Added new Macros to my model - CPI, PPI, VIX (last change I made was to add a beats/meets/misses surprise score a few weeks ago)

I added some interactive features based on these (5 in total). I have learned that these interactives move the model predictability like nothing else - which is why I added more.

R-squared score came back up to approaching .3 now with these.

My correlation is still inverted from 1 yr fwd return - even more so. So I flip the score.

One major change was that I added a "graph_score" which was a nightmare to produce. LLMs can NOT seem to handle this task AT ALL. So I finally had to do it mostly myself, and got something working fairly well - it recognizes good graphs vs bad graphs and downscores bad graph patterns.

I forked the stockdex github project, and am making some changes that I will re-submit back with a git pull. The macrotrends datasource could not handle quarterly data. Once I realized it could be done, I decided to enhance the code to do this so I could run the model and get more quarterly statements alongside the pack of annual statements.

Once I get the model run with annual+quarterly, I will probably retire this project and move onto more recent and LLM-based stuff.

Friday, August 1, 2025

AI / ML - Data Source Comparison with More Data

"To validate generalizability, I ran the same model architecture against two datasets: a limited Yahoo-based dataset and a deeper Stockdex/Macrotrends dataset. Not only did the model trained on Stockdex data achieve a higher R² score (benefiting from more years of data), but the feature importances and pillar scores remained largely stable between the two. This consistency reinforces the robustness of the feature selection process and confirms that the model is learning persistent financial patterns, not artifacts of a specific data source or time window."

Details:

I wanted to move to an LSTM model to predict stock returns, but I was fortunate and patience enough to really read-up and plan this, before just diving into the deep end of the pool.

I learned that the "true AI" models - Transformers, RNNs, et al (of which LSTM is a subclass), require more data. I didn't have anywhere enough data using Yahoo, which gives 4-5 years (at best) of data. And because I was calculating momentum, yoy growth and such, I would always lose one of the years (rows) right off the bat - a significant percentage of already-scarce data.

So, in digging around, I found a Python library called stockdex. It is architected to be able to use multiple data sets, but the default is macrotrends.

But using this library and source this left several challenges:

No quarterly data in the Python API - although the website does have a "Format" drop down for Quarterly and Annual.
The data was pivoted from Yahoo data. Yahoo put columns as items (x), and the time periods as rows (y). This stockdex API downloaded it opposite.
The stockdex had no "names" for the items.

Ultimately, I decided to use this because it returned a LOT more years of data.

First, I put some code together to download the raw data (statements), and then "pivot" the data to match Yahoo's format.
Then, I used a mapping approach to change the columns from Macrotrends to Yahoo - so that I didn't have to change my logic that parsed statements.
I did have to do a run-tweak on the Metrics and Ratios, and fix certain columns that were not coming in correctly.
Lastly, I ran the model - same one as Yahoo and was able to keep the model logic essentially unchanged.

The training took a LOT longer on Stockdex. The combined train+val had 14,436 rows on it.
Here is what we got:
FULL R² -- Train: 0.8277, Validation: 0.3441, Test: 0.3537
PRUNED R² -- Train: 0.7714, Validation: 0.3146, Test: 0.3429
Selected FULL model based on test R².
Final Model Test Metrics -- R²: 0.3537, RMSE: 0.3315, MAE: 0.2282
Feature importance summary:
→ Total features evaluated: 79
→ Non-zero importance features: 75

The model running and scoring, it took a very very long time. Finally it came out with this Top 25 list.
Top 25 Stocks by Final adj_pillar_score:
symbol adj_pillar_score improvement_bonus pillar_score
1209 RUN 1.156474 0.05 1.106474
884 LZM 1.020226 0.05 0.970226
97 ARBK 1.018518 0.00 1.018518
277 CGC 1.009068 0.02 0.989068
262 CCM 0.982228 0.02 0.962228
821 KUKE 0.964131 0.00 0.964131
1415 TRIB 0.963591 0.02 0.943591
1473 UXIN 0.961206 0.05 0.911206
571 FWRD 0.957156 0.05 0.907156
859 LOCL 0.935929 0.00 0.935929
1069 OTLY 0.896289 0.00 0.896289
894 MBI 0.895565 0.05 0.845565
1159 QDEL 0.890248 0.05 0.840248
1039 ODV 0.861127 0.00 0.861127
1522 WBX 0.860391 0.00 0.860391
1578 ZEPP 0.856097 0.02 0.836097
860 LOGC 0.846546 0.05 0.796546
990 NIO 0.811563 0.02 0.791563
1428 TSE 0.775067 0.05 0.725067
930 MODV 0.773322 0.05 0.723322
817 KRNY 0.770282 0.05 0.720282
1545 WNC 0.767113 0.02 0.747113
65 ALUR 0.756362 0.00 0.756362
813 KPTI 0.749644 0.05 0.699644
1316 SRFM 0.743651 0.00 0.743651

Then I ran the smaller Yahoo model:
Training pruned model...
FULL R² -- Train: 0.8181, Validation: 0.3613, Test: 0.2503
PRUNED R² -- Train: 0.8310, Validation: 0.3765, Test: 0.2693
Selected PRUNED model based on test R².
Final Model Test Metrics -- R²: 0.2693, RMSE: 0.4091, MAE: 0.2606
Feature importance summary:
→ Total features evaluated: 30
→ Non-zero importance features: 30

And, the Top 25 report for that one looks like this:
Top 25 Stocks by Final adj_pillar_score:
symbol adj_pillar_score improvement_bonus pillar_score
907 MOGU 1.532545 0.05 1.482545
1233 SKE 1.345178 0.05 1.295178
1170 RPTX 1.334966 0.02 1.314966
419 DQ 1.305644 0.05 1.255644
1211 SES 1.280886 0.05 1.230886
702 IFRX 1.259426 0.00 1.259426
908 MOLN 1.244191 0.02 1.224191
1161 RLYB 1.237648 0.00 1.237648
176 BHVN 1.232199 0.05 1.182199
512 FEDU 1.218868 0.05 1.168868
977 NPWR 1.205679 0.00 1.205679
1533 YQ 1.204367 0.02 1.184367
11 ABUS 1.201539 0.00 1.201539
58 ALLK 1.192839 0.02 1.172839
1249 SMR 1.154462 0.00 1.154462
63 ALXO 1.148672 0.02 1.128672
1482 WBX 1.140147 0.00 1.140147
987 NUVB 1.139138 0.00 1.139138
1128 QS 1.130001 0.02 1.110001
864 LZM 1.098632 0.00 1.098632
16 ACHR 1.094872 0.02 1.074872
1176 RUN 1.059293 0.02 1.039293
758 JMIA 1.053711 0.00 1.053711
94 ARBK 1.049382 0.00 1.049382
1086 PHVS 1.039269 0.05 0.989269

Symbols that appear in both top 25 lists:

RUN (Stockdex rank 1, Yahoo rank 23)
LZM (Stockdex rank 2, Yahoo rank 21)
ARBK (Stockdex rank 3, Yahoo rank 24)
WBX (Stockdex rank 16, Yahoo rank 17)

interesting...

Comparing the top sector reports:

Side-by-side Overlap Analysis Approach

Sector	Symbol(s) (Overlap)	Stockdex Rank & Score	Yahoo Rank & Score	Notes
Basic Materials	LZM, ODV, MAGN	LZM #1 (1.0202), ODV #2 (0.8611), MAGN #4 (0.7287)	LZM #2 (1.0986), ODV #3 (0.9233), MAGN #5 (0.8681)	Close agreement; Yahoo scores higher overall
Communication Services	KUKE, FUBO	KUKE #1 (0.9641), FUBO #4 (0.5936)	KUKE #5 (0.5559), FUBO #4 (0.6544)	Generally consistent rank order
Consumer Cyclical	NIO, UXIN, LOGC	UXIN #1 (0.9612), LOGC #2 (0.8465), NIO #3 (0.8116)	NIO #5 (0.9276), SES #2 (1.2809) not in Stockdex	Partial overlap; Yahoo picks also include SES, MOGU
Consumer Defensive	YQ, OTLY, LOCL	LOCL #1 (0.9359), OTLY #2 (0.8963), YQ #4 (0.6482)	YQ #2 (1.2044), FEDU #1 (1.2189), LOCL missing	Some overlap, differences in top picks
Energy	DWSN, PBF	DWSN #2 (0.6201), PBF #3 (0.4305)	DWSN #1 (0.7613), PBF #3 (0.3556)	Rankings closely aligned
Financial Services	ARBK, MBI, KRNY	ARBK #1 (1.0185), MBI #2 (0.8956), KRNY #3 (0.7703)	ARBK #1 (1.0494), GREE #2 (0.9502) missing MBI, KRNY	Partial overlap
Healthcare	CGC, CCM, TRIB	CGC #1 (1.0091), CCM #2 (0.9822), TRIB #3 (0.9636)	RPTX #1 (1.3350), IFRX #2 (1.2594) missing CGC,etc	Different picks mostly
Industrials	FWRD, EVTL	FWRD #1 (0.9572), EVTL #4 (0.7314)	NPWR #1 (1.2057), EVTL #5 (0.9657)	Some overlap
Real Estate	OPAD, AIV	AIV #1 (0.7303), OPAD #2 (0.7286)	DOUG #1 (0.8116), OPAD #5 (0.5578)	Partial overlap
Technology	RUN, WBX, ZEPP	RUN #1 (1.1565), WBX #2 (0.8604), ZEPP #3 (0.8561)	WBX #2 (1.1401), RUN #3 (1.0593), ZEPP missing	Strong agreement
Utilities	AQN	AQN #1 (0.7382)	OKLO #1 (0.8269) no AQN	Different picks

So - this is excellent model validation, I think. We see some differences due to the amount of time-period data we have, but the results are not widly different.

I think I can now use this data in LSTM perhaps. Or whatever my next steps turn out to be, because I may - before LSTM - try to do some earnings transcript parsing for these if it's possible.

AI / ML - Modeling Fundamentals - Mistakes Found and Corrected

After adding Earnings Surprise Score data into my 1 year fwd return predicting model, I kind of felt as though I had hit the road on the model. The Earnings Surprise Score did move the needle. But with all of the effort in Feature Engineering I had put into this model, the only thing I really felt I could add to it, was sentiment (news). Given that news is more of a real-time concept, grows stale, and would be relevant for only the latest row of data, I decided to do some final reviews, and move on, or "graduate" to some new things - like maybe trying out a neural network or doing more current or real-time analysis. In fact, I had already tried a Quarterly model, but the R-squared on it was terrible and I decided not to use it - not to even ensemble it with the annual report data model.

So - I asked a few different LLMs to code review my model. And I was horrified to learn that because of using LLMs to continually tweak my model, I had wound up with issues related to "Snippet Integration".

Specifically, I had some major issues:

1. Train/Test Split Happened Too Early or Multiple Times

Splitting data before full preprocessing (e.g., before feature scaling, imputation, or log-transforming the target).
Redundant train/test splits defined in multiple places — some commented, some active — leading to potential inconsistencies depending on which was used.

2. No Validation Set

Originally, data was split into training and test sets.

This meant that model tuning (e.g. SHAP threshold, hyperparameter selection) was inadvertently leaking test set information.

Now corrected with a clean train/val/test split.

3. Inconsistent Preprocessing Between Train and Test

Preprocessing steps like imputation, outlier clipping, or scaling were not always applied after the split.
This risked information bleeding from test → train, violating standard ML practice.

4. Improper Handling of Invalid Target Values (fwdreturn)

NaN, inf, and unrealistic values (like ≤ -100%) were not being consistently filtered.

This led to silent corruption of both training and evaluation scores.

Now fixed with a strict invalid_mask and logging/reporting of dropped rows.

5. Redundant or Conflicting Feature Definitions

There were multiple, overlapping blocks like:

features = [col for col in df.columns if ...]
X = df[features]

This made it unclear which feature list was actually being used.

Sector z-scored and raw versions were sometimes duplicated or mixed without clarity.

6. Scaling and Z-Scoring Logic Was Not Modular or Controlled

Originally, some features were being z-scored after asset-scaling (which didn’t make sense).

Some metrics were scaled both to assets and z-scored sector-wise, which polluted the modeling signal.

Now addressed with clear separation and feature naming conventions.

7. SHAP Was Applied to a Noisy or Unclean Feature Space

Without proper pruning first (e.g., dropping all-NaN columns), SHAP feature importance included irrelevant or broken columns.

This could inflate feature count or misguide model interpretation.

Now resolved by cleaning feature set before SHAP and applying SHAP-based selection only on valid, imputed data.

One issue, was that the code for modeling was in the main() function which had gotten way too long and hard to read. This function had all of the training, testing, splitting/pruning, it had the model fitting in it, AND all of the scoring. I broke out the process of train/validate/test - and put that process into play for both the full model, as well as the SHAP-pruned model. Then I took the best r-squared from both runs and used the winning model.