Friday, August 1, 2025

AI / ML - Data Source Comparison with More Data

"To validate generalizability, I ran the same model architecture against two datasets: a limited Yahoo-based dataset and a deeper Stockdex/Macrotrends dataset. Not only did the model trained on Stockdex data achieve a higher R² score (benefiting from more years of data), but the feature importances and pillar scores remained largely stable between the two. This consistency reinforces the robustness of the feature selection process and confirms that the model is learning persistent financial patterns, not artifacts of a specific data source or time window."

Details:

I wanted to move to an LSTM model to predict stock returns, but I was fortunate and patience enough to really read-up and plan this, before just diving into the deep end of the pool.

I learned that the "true AI" models - Transformers, RNNs, et al (of which LSTM is a subclass), require more data. I didn't have anywhere enough data using Yahoo, which gives 4-5 years (at best) of data. And because I was calculating momentum, yoy growth and such, I would always lose one of the years (rows) right off the bat - a significant percentage of already-scarce data.

So, in digging around, I found a Python library called stockdex. It is architected to be able to use multiple data sets, but the default is macrotrends. 

But using this library and source this left several challenges:

  1. No quarterly data in the Python API - although the website does have a "Format" drop down for Quarterly and Annual.
  2. The data was pivoted from Yahoo data. Yahoo put columns as items (x), and the time periods as rows (y). This stockdex API downloaded it opposite. 
  3. The stockdex had no "names" for the items. 

Ultimately, I decided to use this because it returned a LOT more years of data.

  1. First, I put some code together to download the raw data (statements), and then "pivot" the data to match Yahoo's format. 
  2. Then, I used a mapping approach to change the columns from Macrotrends to Yahoo - so that I didn't have to change my logic that parsed statements.
  3. I did have to do a run-tweak on the Metrics and Ratios, and fix certain columns that were not coming in correctly.
  4. Lastly, I ran the model - same one as Yahoo and was able to keep the model logic essentially unchanged. 

The training took a LOT longer on Stockdex. The combined train+val had 14,436 rows on it.
Here is what we got:
FULL R² -- Train: 0.8277, Validation: 0.3441, Test: 0.3537
PRUNED R² -- Train: 0.7714, Validation: 0.3146, Test: 0.3429
Selected FULL model based on test R².
Final Model Test Metrics -- R²: 0.3537, RMSE: 0.3315, MAE: 0.2282
Feature importance summary:
  → Total features evaluated: 79
  → Non-zero importance features: 75

The model running and scoring, it took a very very long time. Finally it came out with this Top 25 list.
Top 25 Stocks by Final adj_pillar_score:
     symbol  adj_pillar_score  improvement_bonus  pillar_score
1209    RUN          1.156474               0.05      1.106474
884     LZM          1.020226               0.05      0.970226
97     ARBK          1.018518               0.00      1.018518
277     CGC          1.009068               0.02      0.989068
262     CCM          0.982228               0.02      0.962228
821    KUKE          0.964131               0.00      0.964131
1415   TRIB          0.963591               0.02      0.943591
1473   UXIN          0.961206               0.05      0.911206
571    FWRD          0.957156               0.05      0.907156
859    LOCL          0.935929               0.00      0.935929
1069   OTLY          0.896289               0.00      0.896289
894     MBI          0.895565               0.05      0.845565
1159   QDEL          0.890248               0.05      0.840248
1039    ODV          0.861127               0.00      0.861127
1522    WBX          0.860391               0.00      0.860391
1578   ZEPP          0.856097               0.02      0.836097
860    LOGC          0.846546               0.05      0.796546
990     NIO          0.811563               0.02      0.791563
1428    TSE          0.775067               0.05      0.725067
930    MODV          0.773322               0.05      0.723322
817    KRNY          0.770282               0.05      0.720282
1545    WNC          0.767113               0.02      0.747113
65     ALUR          0.756362               0.00      0.756362
813    KPTI          0.749644               0.05      0.699644
1316   SRFM          0.743651               0.00      0.743651

Then I ran the smaller Yahoo model:
Training pruned model...
FULL R² -- Train: 0.8181, Validation: 0.3613, Test: 0.2503
PRUNED R² -- Train: 0.8310, Validation: 0.3765, Test: 0.2693
Selected PRUNED model based on test R².
Final Model Test Metrics -- R²: 0.2693, RMSE: 0.4091, MAE: 0.2606
Feature importance summary:
  → Total features evaluated: 30
  → Non-zero importance features: 30

And, the Top 25 report for that one looks like this:
Top 25 Stocks by Final adj_pillar_score:
     symbol  adj_pillar_score  improvement_bonus  pillar_score
907    MOGU          1.532545               0.05      1.482545
1233    SKE          1.345178               0.05      1.295178
1170   RPTX          1.334966               0.02      1.314966
419      DQ          1.305644               0.05      1.255644
1211    SES          1.280886               0.05      1.230886
702    IFRX          1.259426               0.00      1.259426
908    MOLN          1.244191               0.02      1.224191
1161   RLYB          1.237648               0.00      1.237648
176    BHVN          1.232199               0.05      1.182199
512    FEDU          1.218868               0.05      1.168868
977    NPWR          1.205679               0.00      1.205679
1533     YQ          1.204367               0.02      1.184367
11     ABUS          1.201539               0.00      1.201539
58     ALLK          1.192839               0.02      1.172839
1249    SMR          1.154462               0.00      1.154462
63     ALXO          1.148672               0.02      1.128672
1482    WBX          1.140147               0.00      1.140147
987    NUVB          1.139138               0.00      1.139138
1128     QS          1.130001               0.02      1.110001
864     LZM          1.098632               0.00      1.098632
16     ACHR          1.094872               0.02      1.074872
1176    RUN          1.059293               0.02      1.039293
758    JMIA          1.053711               0.00      1.053711
94     ARBK          1.049382               0.00      1.049382
1086   PHVS          1.039269               0.05      0.989269

Symbols that appear in both top 25 lists:

  • RUN (Stockdex rank 1, Yahoo rank 23)

  • LZM (Stockdex rank 2, Yahoo rank 21)

  • ARBK (Stockdex rank 3, Yahoo rank 24)

  • WBX (Stockdex rank 16, Yahoo rank 17)

interesting...

Comparing the top sector reports:

Side-by-side Overlap Analysis Approach

SectorSymbol(s) (Overlap)Stockdex Rank & ScoreYahoo Rank & ScoreNotes
Basic MaterialsLZM, ODV, MAGNLZM #1 (1.0202), ODV #2 (0.8611), MAGN #4 (0.7287)LZM #2 (1.0986), ODV #3 (0.9233), MAGN #5 (0.8681)Close agreement; Yahoo scores higher overall
Communication ServicesKUKE, FUBOKUKE #1 (0.9641), FUBO #4 (0.5936)KUKE #5 (0.5559), FUBO #4 (0.6544)Generally consistent rank order
Consumer CyclicalNIO, UXIN, LOGCUXIN #1 (0.9612), LOGC #2 (0.8465), NIO #3 (0.8116)NIO #5 (0.9276), SES #2 (1.2809) not in StockdexPartial overlap; Yahoo picks also include SES, MOGU
Consumer DefensiveYQ, OTLY, LOCLLOCL #1 (0.9359), OTLY #2 (0.8963), YQ #4 (0.6482)YQ #2 (1.2044), FEDU #1 (1.2189), LOCL missingSome overlap, differences in top picks
EnergyDWSN, PBFDWSN #2 (0.6201), PBF #3 (0.4305)DWSN #1 (0.7613), PBF #3 (0.3556)Rankings closely aligned
Financial ServicesARBK, MBI, KRNYARBK #1 (1.0185), MBI #2 (0.8956), KRNY #3 (0.7703)ARBK #1 (1.0494), GREE #2 (0.9502) missing MBI, KRNYPartial overlap
HealthcareCGC, CCM, TRIBCGC #1 (1.0091), CCM #2 (0.9822), TRIB #3 (0.9636)RPTX #1 (1.3350), IFRX #2 (1.2594) missing CGC,etcDifferent picks mostly
IndustrialsFWRD, EVTLFWRD #1 (0.9572), EVTL #4 (0.7314)NPWR #1 (1.2057), EVTL #5 (0.9657)Some overlap
Real EstateOPAD, AIVAIV #1 (0.7303), OPAD #2 (0.7286)DOUG #1 (0.8116), OPAD #5 (0.5578)Partial overlap
TechnologyRUN, WBX, ZEPPRUN #1 (1.1565), WBX #2 (0.8604), ZEPP #3 (0.8561)WBX #2 (1.1401), RUN #3 (1.0593), ZEPP missingStrong agreement
UtilitiesAQNAQN #1 (0.7382)OKLO #1 (0.8269) no AQNDifferent picks

So - this is excellent model validation, I think. We see some differences due to the amount of time-period data we have, but the results are not widly different. 

I think I can now use this data in LSTM perhaps. Or whatever my next steps turn out to be, because I may - before LSTM - try to do some earnings transcript parsing for these if it's possible.


No comments:

AI / ML - Data Source Comparison with More Data

" To validate generalizability, I ran the same model architecture against two datasets: a limited Yahoo-based dataset and a deeper Stoc...