Friday, August 1, 2025

AI / ML - Data Source Comparison with More Data

"To validate generalizability, I ran the same model architecture against two datasets: a limited Yahoo-based dataset and a deeper Stockdex/Macrotrends dataset. Not only did the model trained on Stockdex data achieve a higher R² score (benefiting from more years of data), but the feature importances and pillar scores remained largely stable between the two. This consistency reinforces the robustness of the feature selection process and confirms that the model is learning persistent financial patterns, not artifacts of a specific data source or time window."

Details:

I wanted to move to an LSTM model to predict stock returns, but I was fortunate and patience enough to really read-up and plan this, before just diving into the deep end of the pool.

I learned that the "true AI" models - Transformers, RNNs, et al (of which LSTM is a subclass), require more data. I didn't have anywhere enough data using Yahoo, which gives 4-5 years (at best) of data. And because I was calculating momentum, yoy growth and such, I would always lose one of the years (rows) right off the bat - a significant percentage of already-scarce data.

So, in digging around, I found a Python library called stockdex. It is architected to be able to use multiple data sets, but the default is macrotrends. 

But using this library and source this left several challenges:

  1. No quarterly data in the Python API - although the website does have a "Format" drop down for Quarterly and Annual.
  2. The data was pivoted from Yahoo data. Yahoo put columns as items (x), and the time periods as rows (y). This stockdex API downloaded it opposite. 
  3. The stockdex had no "names" for the items. 

Ultimately, I decided to use this because it returned a LOT more years of data.

  1. First, I put some code together to download the raw data (statements), and then "pivot" the data to match Yahoo's format. 
  2. Then, I used a mapping approach to change the columns from Macrotrends to Yahoo - so that I didn't have to change my logic that parsed statements.
  3. I did have to do a run-tweak on the Metrics and Ratios, and fix certain columns that were not coming in correctly.
  4. Lastly, I ran the model - same one as Yahoo and was able to keep the model logic essentially unchanged. 

The training took a LOT longer on Stockdex. The combined train+val had 14,436 rows on it.
Here is what we got:
FULL R² -- Train: 0.8277, Validation: 0.3441, Test: 0.3537
PRUNED R² -- Train: 0.7714, Validation: 0.3146, Test: 0.3429
Selected FULL model based on test R².
Final Model Test Metrics -- R²: 0.3537, RMSE: 0.3315, MAE: 0.2282
Feature importance summary:
  → Total features evaluated: 79
  → Non-zero importance features: 75

The model running and scoring, it took a very very long time. Finally it came out with this Top 25 list.
Top 25 Stocks by Final adj_pillar_score:
     symbol  adj_pillar_score  improvement_bonus  pillar_score
1209    RUN          1.156474               0.05      1.106474
884     LZM          1.020226               0.05      0.970226
97     ARBK          1.018518               0.00      1.018518
277     CGC          1.009068               0.02      0.989068
262     CCM          0.982228               0.02      0.962228
821    KUKE          0.964131               0.00      0.964131
1415   TRIB          0.963591               0.02      0.943591
1473   UXIN          0.961206               0.05      0.911206
571    FWRD          0.957156               0.05      0.907156
859    LOCL          0.935929               0.00      0.935929
1069   OTLY          0.896289               0.00      0.896289
894     MBI          0.895565               0.05      0.845565
1159   QDEL          0.890248               0.05      0.840248
1039    ODV          0.861127               0.00      0.861127
1522    WBX          0.860391               0.00      0.860391
1578   ZEPP          0.856097               0.02      0.836097
860    LOGC          0.846546               0.05      0.796546
990     NIO          0.811563               0.02      0.791563
1428    TSE          0.775067               0.05      0.725067
930    MODV          0.773322               0.05      0.723322
817    KRNY          0.770282               0.05      0.720282
1545    WNC          0.767113               0.02      0.747113
65     ALUR          0.756362               0.00      0.756362
813    KPTI          0.749644               0.05      0.699644
1316   SRFM          0.743651               0.00      0.743651

Then I ran the smaller Yahoo model:
Training pruned model...
FULL R² -- Train: 0.8181, Validation: 0.3613, Test: 0.2503
PRUNED R² -- Train: 0.8310, Validation: 0.3765, Test: 0.2693
Selected PRUNED model based on test R².
Final Model Test Metrics -- R²: 0.2693, RMSE: 0.4091, MAE: 0.2606
Feature importance summary:
  → Total features evaluated: 30
  → Non-zero importance features: 30

And, the Top 25 report for that one looks like this:
Top 25 Stocks by Final adj_pillar_score:
     symbol  adj_pillar_score  improvement_bonus  pillar_score
907    MOGU          1.532545               0.05      1.482545
1233    SKE          1.345178               0.05      1.295178
1170   RPTX          1.334966               0.02      1.314966
419      DQ          1.305644               0.05      1.255644
1211    SES          1.280886               0.05      1.230886
702    IFRX          1.259426               0.00      1.259426
908    MOLN          1.244191               0.02      1.224191
1161   RLYB          1.237648               0.00      1.237648
176    BHVN          1.232199               0.05      1.182199
512    FEDU          1.218868               0.05      1.168868
977    NPWR          1.205679               0.00      1.205679
1533     YQ          1.204367               0.02      1.184367
11     ABUS          1.201539               0.00      1.201539
58     ALLK          1.192839               0.02      1.172839
1249    SMR          1.154462               0.00      1.154462
63     ALXO          1.148672               0.02      1.128672
1482    WBX          1.140147               0.00      1.140147
987    NUVB          1.139138               0.00      1.139138
1128     QS          1.130001               0.02      1.110001
864     LZM          1.098632               0.00      1.098632
16     ACHR          1.094872               0.02      1.074872
1176    RUN          1.059293               0.02      1.039293
758    JMIA          1.053711               0.00      1.053711
94     ARBK          1.049382               0.00      1.049382
1086   PHVS          1.039269               0.05      0.989269

Symbols that appear in both top 25 lists:

  • RUN (Stockdex rank 1, Yahoo rank 23)

  • LZM (Stockdex rank 2, Yahoo rank 21)

  • ARBK (Stockdex rank 3, Yahoo rank 24)

  • WBX (Stockdex rank 16, Yahoo rank 17)

interesting...

Comparing the top sector reports:

Side-by-side Overlap Analysis Approach

SectorSymbol(s) (Overlap)Stockdex Rank & ScoreYahoo Rank & ScoreNotes
Basic MaterialsLZM, ODV, MAGNLZM #1 (1.0202), ODV #2 (0.8611), MAGN #4 (0.7287)LZM #2 (1.0986), ODV #3 (0.9233), MAGN #5 (0.8681)Close agreement; Yahoo scores higher overall
Communication ServicesKUKE, FUBOKUKE #1 (0.9641), FUBO #4 (0.5936)KUKE #5 (0.5559), FUBO #4 (0.6544)Generally consistent rank order
Consumer CyclicalNIO, UXIN, LOGCUXIN #1 (0.9612), LOGC #2 (0.8465), NIO #3 (0.8116)NIO #5 (0.9276), SES #2 (1.2809) not in StockdexPartial overlap; Yahoo picks also include SES, MOGU
Consumer DefensiveYQ, OTLY, LOCLLOCL #1 (0.9359), OTLY #2 (0.8963), YQ #4 (0.6482)YQ #2 (1.2044), FEDU #1 (1.2189), LOCL missingSome overlap, differences in top picks
EnergyDWSN, PBFDWSN #2 (0.6201), PBF #3 (0.4305)DWSN #1 (0.7613), PBF #3 (0.3556)Rankings closely aligned
Financial ServicesARBK, MBI, KRNYARBK #1 (1.0185), MBI #2 (0.8956), KRNY #3 (0.7703)ARBK #1 (1.0494), GREE #2 (0.9502) missing MBI, KRNYPartial overlap
HealthcareCGC, CCM, TRIBCGC #1 (1.0091), CCM #2 (0.9822), TRIB #3 (0.9636)RPTX #1 (1.3350), IFRX #2 (1.2594) missing CGC,etcDifferent picks mostly
IndustrialsFWRD, EVTLFWRD #1 (0.9572), EVTL #4 (0.7314)NPWR #1 (1.2057), EVTL #5 (0.9657)Some overlap
Real EstateOPAD, AIVAIV #1 (0.7303), OPAD #2 (0.7286)DOUG #1 (0.8116), OPAD #5 (0.5578)Partial overlap
TechnologyRUN, WBX, ZEPPRUN #1 (1.1565), WBX #2 (0.8604), ZEPP #3 (0.8561)WBX #2 (1.1401), RUN #3 (1.0593), ZEPP missingStrong agreement
UtilitiesAQNAQN #1 (0.7382)OKLO #1 (0.8269) no AQNDifferent picks

So - this is excellent model validation, I think. We see some differences due to the amount of time-period data we have, but the results are not widly different. 

I think I can now use this data in LSTM perhaps. Or whatever my next steps turn out to be, because I may - before LSTM - try to do some earnings transcript parsing for these if it's possible.


AI / ML - Modeling Fundamentals - Mistakes Found and Corrected

After adding Earnings Surprise Score data into my 1 year fwd return predicting model, I kind of felt as though I had hit the road on the model. The Earnings Surprise Score did move the needle. But with all of the effort in Feature Engineering I had put into this model, the only thing I really felt I could add to it, was sentiment (news). Given that news is more of a real-time concept, grows stale, and would be relevant for only the latest row of data, I decided to do some final reviews, and move on, or "graduate" to some new things - like maybe trying out a neural network or doing more current or real-time analysis. In fact, I had already tried a Quarterly model, but the R-squared on it was terrible and I decided not to use it - not to even ensemble it with the annual report data model.

So - I asked a few different LLMs to code review my model. And I was horrified to learn that because of using LLMs to continually tweak my model, I had wound up with issues related to "Snippet Integration". 

Specifically, I had some major issues:

1. Train/Test Split Happened Too Early or Multiple Times

  •  Splitting data before full preprocessing (e.g., before feature scaling, imputation, or log-transforming the target).
  •  Redundant train/test splits defined in multiple places — some commented, some active — leading to potential inconsistencies depending on which was used.


2. No Validation Set

  •  Originally, data was split into training and test sets.
    •  This meant that model tuning (e.g. SHAP threshold, hyperparameter selection) was inadvertently leaking test set information. 

  •  Now corrected with a clean train/val/test split.


3. Inconsistent Preprocessing Between Train and Test

  •  Preprocessing steps like imputation, outlier clipping, or scaling were not always applied after the split.
  •  This risked information bleeding from test → train, violating standard ML practice.


4. Improper Handling of Invalid Target Values (fwdreturn)

  •  NaN, inf, and unrealistic values (like ≤ -100%) were not being consistently filtered.
  •  This led to silent corruption of both training and evaluation scores.
  •  Now fixed with a strict invalid_mask and logging/reporting of dropped rows.


5. Redundant or Conflicting Feature Definitions

  •  There were multiple, overlapping blocks like:

                features = [col for col in df.columns if ...]
                X = df[features]

  •  This made it unclear which feature list was actually being used.
  •  Sector z-scored and raw versions were sometimes duplicated or mixed without clarity.


6. Scaling and Z-Scoring Logic Was Not Modular or Controlled

  •  Originally, some features were being z-scored after asset-scaling (which didn’t make sense).
  •  Some metrics were scaled both to assets and z-scored sector-wise, which polluted the modeling signal.
  •  Now addressed with clear separation and feature naming conventions.


7. SHAP Was Applied to a Noisy or Unclean Feature Space

  •  Without proper pruning first (e.g., dropping all-NaN columns), SHAP feature importance included irrelevant or broken columns.
    •  This could inflate feature count or misguide model interpretation.
  •  Now resolved by cleaning feature set before SHAP and applying SHAP-based selection only on valid, imputed data.
One issue, was that the code for modeling was in the main() function which had gotten way too long and hard to read. This function had all of the training, testing, splitting/pruning, it had the model fitting in it, AND all of the scoring. I broke out the process of train/validate/test - and put that process into play for both the full model, as well as the SHAP-pruned model. Then I took the best r-squared from both runs and used the winning model.

Tuesday, July 29, 2025

AI / ML - Feature Engineering - Earnings Surprises

I really felt that news would be a great thing to add to the model. But the problem with news, is that news is recent, and this data I am using with XGBoost is historical time-series data. 

If you added news, what would you do - cram the values only into the most recent year?

I think if you go with stuff that is changing and close to real-time, you need to re-think the whole model including the type of model. Maybe news works better with a Transformer model or a LSTM Neural Network model than a predictive regression model.

So - I am running out of new things to add to my model to try and boost the predictability of it (increase the R-squared).

Then I came up with the idea of adding earnings hits, misses and meets. A quick consult with an LLM suggested using an earnings_surprise score, so that not only do we get the misses/meets/beats counts but also we capture the magnitude.  A great idea.

I implemented this, and lo and behold, the earnings_surprise score moves the needle. Substantially and consistently.

The best thing about this, is that the earnings_surprise score is symbol-specific, and so it is not some macro feature I have to figure out how to interact with the symbol data. 

Wednesday, July 9, 2025

AI / ML - Feature Engineering - Interaction Features

I added some new macro features to my model - credit card debt, credit card delinquency, and unemployment data.

Some of these were VERY influential features.

So we can see that unemployment_rate is an important feature! It tops the list!!!

But - since we are doing relative scoring on stocks, what good does that do us, if every single stock sees the same macro values???

The answer: Interaction Features. 

Since Unemployment can impact revenue growth (less consumers can afford to buy), you multiply the Revenue Growth Year-Over-Year percentage, but the unemployment. Now, you get a UNIQUE value that works for that specific stock symbol instead of just throwing "across the board" metrics at every stock. 

Now, if you don't do this, the macro variables in and of themselves CAN impact a model, especially if a stock's forward return is sensitive to that feature. That is what XGBoost gives you. But you help the correlation by giving everyone a uniquely calculated impact, as opposed to giving everyone a value that equals "X.Y".

I did this, and got my latest high score on R-Squared.
Selected 30 features out of 97 (threshold = 0.007004095707088709)
⭐ New best model saved with R²: 0.4001

Pruned XGBoost Model R² score: 0.4001
Pruned XGBoost Model RMSE: 0.3865
Pruned XGBoost Model MAE: 0.2627

Full XGBoost R² score: 0.3831
Full XGBoost RMSE 0.3919
Full XGBoost MAE: 0.2694

Update: Now, it should be noted that both the raw macro metrics as well as the interactive ones are thrown into the model - and we let the process (i.e. SHAP) decide which ones - the raw or the interaction feature - to use in the subsequent pruned model.  

In the screenshot above, BOTH of them appear at the top. Unemployment is #1 far and away. And the unemployment_x_rev_grw metrics, an interactive feature that combines stock-specific data with the macro data, is also ranked high but below the raw metric.

I get differing opinions on whether it is good practice to leave them both in the model (raw macro features AND derived interactive features), or to decide on one or the other. It is awfully hard to prune off and not use your top influencer of prediction.

If you have any comments or thoughts on this, please feel free to comment. 


 

Saturday, July 5, 2025

AI / ML - Here is Why You Backtest

My model was working nicely. 

It scored stocks on a number of fronts (pillars).

It used Profitability. It used Solvency. It used Liquidity. It used Efficiency.

These are the "four horseman" of stock evaluation.

I added some of my own twists to the "grading formula", in the form of macro variables (consumer sentiment, business confidence, et al).  I also had some trend analysis, rewarding trends up and penalizing trends down. I rewarded (and penalized) profitability, cash flow, etc. I had scaling done correctly, too in order to ensure a "fair playing field", and also some sector normalization as well.

When I ran the model, using XGBoost to predict 1-year forward return, the stocks at the top of the report looked great when I spot-checked them against various sites that also grade out stocks. I felt good. The r-squared I was getting from XGBoost and a SHAP-pruned feature run was at academic levels (as high as .46 at one point).

As part of some final QA, I ran the resultant code through AI engines which praised the thoroughness, and slapped me on the back reassuring me that my model was on a par with, if not superior to, many academic models.

Then - someone asked me if this model has this been back-tested. 
And the answer was no.  I had not back-tested it up to that point. I didn't think I was ready for back-testing. 

Maybe back-testing is an iterative "continual improvement" process that should be done much earlier in the process, to ensure you don't go down the wrong road.  But I didn't do that.

So, I ran a back-test. And to my horror, the model was completely "upside down" in terms of stocks that would predict forward return.  The AI engines suggested I simply "flip the sign" on my score and invert them. But that didn't feel right. It felt like I was trying to force a score.  

So - the first thing we did, was evaluate the scoring. We looked at correlation between individual scoring pillars and forward return. Negative.

We then looked at correlation in more detail.

First, we calculated Pearson (row-level) and Spearman (rank-level). correlations.

They were negative.

Then, we calculated Average Fwd Return by Score Decile. Sure enough, there was a trend, but completely backwards from what one would expect. 

Quality stocks with scores of 9,8,7,6,5 had negative values that improved as the decile dropped, while the shaky stocks (0,1,2,3,4) had graduated positive values.

The interesting analysis, was a dump of the correlations of each individual pillar to fwd return. The strongest were Profitability and Valuation, followed by MacroBehavior (macroeconomic features) but these were not strong. And the correlations were slightly negative, a couple slightly above zero positive.

But - one was VERY interesting. A log1p correlation between the "final composite score" to forward return that was noticeable if not sizable - but negative.

We experimented with commenting out the penalties, so we could focus on "true metrics" (a flag was engineered in to turn these off which made it easy to test). Re-ran the model, STILL the correlations with forward return were negative.

Then - we decided to remove individual pillars. Didn't change a thing. STILL the correlations with forward return were negative.

Finally, after the AI ensured me - after reviewing the code - that there were no scoring errors, the only thing left to try, aside of shelving the model for lack of success in predicting forward return, was to in fact put a negative sign on the score to invert it and "flip the score".

I did this. And, while the companies that bubbled to the top were shaky on their fundamentals, I did see cases where Analyst Ratings on these stocks were above (and in some cases way above) the current stock price.  

So here is the evidence that we have a model that IS predicting forward return, in a real way.

So - in conclusion. Quality does NOT necessarily equate to forward return.

What does??? Well, nothing in those pillars individually. But - when you combined all of these metrics/features into a big pot, and send them to a sophisticated regression modeler, it does find a magical combination that can predict a relationship with forward return that is linear, and depending on whether you flip that line one way or another, you can theoretically gain - or lose - a return on your money.

Now, if we had put money into those "great stocks" at the top of that prior list, and then had to watch as we lost money, it would have been puzzling and frustrating. But - do we have the courage to put money into these less-than-stellar fundamental stocks to see if this model is right, and that we WILL get a positive forward return? 

I guess it takes some experimentation. Either a simulator, OR, put $X into the top ten and another $X into the bottom ten and see how the perform. Which is what I might be doing shortly. 


Wednesday, July 2, 2025

AI / ML - Altman-Z Score

I saw a website that was showing an Altman-Z score for companies in their Solvency section.

Not fully aware of this calculation, I decided to jump in and add it to my model.

I quickly backed it out.

Why?

Altman-Z uses different calculations based on the industry a company is in. 

Manufacturers use one calculation, other companies use another, and banks and finance companies don't calculate it at all. 

So imagine calculating this and feeding it into XGBoost / SHAP to predict price or return on a security. 

First of all, because you have so many NaN values (nonexistents), you have a Missingness issue. Then, the values differ due to different calculation methods. If you don't cap the score, you can get outliers that wreak havoc.

So in the end, it's fine to calculate it, but if you calculate it, don't model it as a predictive feature. 

Just calculate it and "tack it on" (staple it) to any sector-specific scores you are generating for purposes of stuff like rank within sector. 

AI / ML Feature Explosion and Normalization

I got into some trouble with my model where it blew up to 150-180 features. When I finally took the time to really scrutinize things, I noticed that I had duplicates of raw metrics alongside their sector-z-scored components.  I don't know how that happened, or how those sector-z-scored components got into the feature set submitted to XGBoost and SHAP.  Probably logic inserted into the wrong place.

I wound up removing all of the sector-z scored metrics for now. 

But this highlighted a problem. Semantics.

I had some metrics that needed to be normalized out for scoring and comparison purposes. Mostly raw metrics, and to do this, we divided the value by TotalAssets.  For metrics that we did NOT want to do this to, we had some exclusion logic based on Regular Expressions (regex). We looked for metrics that had "Per" and "To" (among others).

This seems to have fixed our set of features, and it is so much better to see 30 features of 80 selected instead of 100 features of 180. It reduced a ton of noise on the model, improving its integrity.

Now I do need to go back and examine why we did the sector z-scores initially, to see if that is something we do need to engineer back in. I think we need to do that in the cases where we are producing a Top-X-By-Sector report. 

AI / ML - Data Source Comparison with More Data

" To validate generalizability, I ran the same model architecture against two datasets: a limited Yahoo-based dataset and a deeper Stoc...