Grasping Technology: AI / ML - Data Source Comparison with More Data

Friday, August 1, 2025

AI / ML - Data Source Comparison with More Data

"To validate generalizability, I ran the same model architecture against two datasets: a limited Yahoo-based dataset and a deeper Stockdex/Macrotrends dataset. Not only did the model trained on Stockdex data achieve a higher R² score (benefiting from more years of data), but the feature importances and pillar scores remained largely stable between the two. This consistency reinforces the robustness of the feature selection process and confirms that the model is learning persistent financial patterns, not artifacts of a specific data source or time window."

Details:

I wanted to move to an LSTM model to predict stock returns, but I was fortunate and patience enough to really read-up and plan this, before just diving into the deep end of the pool.

I learned that the "true AI" models - Transformers, RNNs, et al (of which LSTM is a subclass), require more data. I didn't have anywhere enough data using Yahoo, which gives 4-5 years (at best) of data. And because I was calculating momentum, yoy growth and such, I would always lose one of the years (rows) right off the bat - a significant percentage of already-scarce data.

So, in digging around, I found a Python library called stockdex. It is architected to be able to use multiple data sets, but the default is macrotrends.

But using this library and source this left several challenges:

No quarterly data in the Python API - although the website does have a "Format" drop down for Quarterly and Annual.
The data was pivoted from Yahoo data. Yahoo put columns as items (x), and the time periods as rows (y). This stockdex API downloaded it opposite.
The stockdex had no "names" for the items.

Ultimately, I decided to use this because it returned a LOT more years of data.

First, I put some code together to download the raw data (statements), and then "pivot" the data to match Yahoo's format.
Then, I used a mapping approach to change the columns from Macrotrends to Yahoo - so that I didn't have to change my logic that parsed statements.
I did have to do a run-tweak on the Metrics and Ratios, and fix certain columns that were not coming in correctly.
Lastly, I ran the model - same one as Yahoo and was able to keep the model logic essentially unchanged.

The training took a LOT longer on Stockdex. The combined train+val had 14,436 rows on it.
Here is what we got:
FULL R² -- Train: 0.8277, Validation: 0.3441, Test: 0.3537
PRUNED R² -- Train: 0.7714, Validation: 0.3146, Test: 0.3429
Selected FULL model based on test R².
Final Model Test Metrics -- R²: 0.3537, RMSE: 0.3315, MAE: 0.2282
Feature importance summary:
→ Total features evaluated: 79
→ Non-zero importance features: 75

The model running and scoring, it took a very very long time. Finally it came out with this Top 25 list.
Top 25 Stocks by Final adj_pillar_score:
symbol adj_pillar_score improvement_bonus pillar_score
1209 RUN 1.156474 0.05 1.106474
884 LZM 1.020226 0.05 0.970226
97 ARBK 1.018518 0.00 1.018518
277 CGC 1.009068 0.02 0.989068
262 CCM 0.982228 0.02 0.962228
821 KUKE 0.964131 0.00 0.964131
1415 TRIB 0.963591 0.02 0.943591
1473 UXIN 0.961206 0.05 0.911206
571 FWRD 0.957156 0.05 0.907156
859 LOCL 0.935929 0.00 0.935929
1069 OTLY 0.896289 0.00 0.896289
894 MBI 0.895565 0.05 0.845565
1159 QDEL 0.890248 0.05 0.840248
1039 ODV 0.861127 0.00 0.861127
1522 WBX 0.860391 0.00 0.860391
1578 ZEPP 0.856097 0.02 0.836097
860 LOGC 0.846546 0.05 0.796546
990 NIO 0.811563 0.02 0.791563
1428 TSE 0.775067 0.05 0.725067
930 MODV 0.773322 0.05 0.723322
817 KRNY 0.770282 0.05 0.720282
1545 WNC 0.767113 0.02 0.747113
65 ALUR 0.756362 0.00 0.756362
813 KPTI 0.749644 0.05 0.699644
1316 SRFM 0.743651 0.00 0.743651

Then I ran the smaller Yahoo model:
Training pruned model...
FULL R² -- Train: 0.8181, Validation: 0.3613, Test: 0.2503
PRUNED R² -- Train: 0.8310, Validation: 0.3765, Test: 0.2693
Selected PRUNED model based on test R².
Final Model Test Metrics -- R²: 0.2693, RMSE: 0.4091, MAE: 0.2606
Feature importance summary:
→ Total features evaluated: 30
→ Non-zero importance features: 30

And, the Top 25 report for that one looks like this:
Top 25 Stocks by Final adj_pillar_score:
symbol adj_pillar_score improvement_bonus pillar_score
907 MOGU 1.532545 0.05 1.482545
1233 SKE 1.345178 0.05 1.295178
1170 RPTX 1.334966 0.02 1.314966
419 DQ 1.305644 0.05 1.255644
1211 SES 1.280886 0.05 1.230886
702 IFRX 1.259426 0.00 1.259426
908 MOLN 1.244191 0.02 1.224191
1161 RLYB 1.237648 0.00 1.237648
176 BHVN 1.232199 0.05 1.182199
512 FEDU 1.218868 0.05 1.168868
977 NPWR 1.205679 0.00 1.205679
1533 YQ 1.204367 0.02 1.184367
11 ABUS 1.201539 0.00 1.201539
58 ALLK 1.192839 0.02 1.172839
1249 SMR 1.154462 0.00 1.154462
63 ALXO 1.148672 0.02 1.128672
1482 WBX 1.140147 0.00 1.140147
987 NUVB 1.139138 0.00 1.139138
1128 QS 1.130001 0.02 1.110001
864 LZM 1.098632 0.00 1.098632
16 ACHR 1.094872 0.02 1.074872
1176 RUN 1.059293 0.02 1.039293
758 JMIA 1.053711 0.00 1.053711
94 ARBK 1.049382 0.00 1.049382
1086 PHVS 1.039269 0.05 0.989269

Symbols that appear in both top 25 lists:

RUN (Stockdex rank 1, Yahoo rank 23)
LZM (Stockdex rank 2, Yahoo rank 21)
ARBK (Stockdex rank 3, Yahoo rank 24)
WBX (Stockdex rank 16, Yahoo rank 17)

interesting...

Comparing the top sector reports:

Side-by-side Overlap Analysis Approach

Sector	Symbol(s) (Overlap)	Stockdex Rank & Score	Yahoo Rank & Score	Notes
Basic Materials	LZM, ODV, MAGN	LZM #1 (1.0202), ODV #2 (0.8611), MAGN #4 (0.7287)	LZM #2 (1.0986), ODV #3 (0.9233), MAGN #5 (0.8681)	Close agreement; Yahoo scores higher overall
Communication Services	KUKE, FUBO	KUKE #1 (0.9641), FUBO #4 (0.5936)	KUKE #5 (0.5559), FUBO #4 (0.6544)	Generally consistent rank order
Consumer Cyclical	NIO, UXIN, LOGC	UXIN #1 (0.9612), LOGC #2 (0.8465), NIO #3 (0.8116)	NIO #5 (0.9276), SES #2 (1.2809) not in Stockdex	Partial overlap; Yahoo picks also include SES, MOGU
Consumer Defensive	YQ, OTLY, LOCL	LOCL #1 (0.9359), OTLY #2 (0.8963), YQ #4 (0.6482)	YQ #2 (1.2044), FEDU #1 (1.2189), LOCL missing	Some overlap, differences in top picks
Energy	DWSN, PBF	DWSN #2 (0.6201), PBF #3 (0.4305)	DWSN #1 (0.7613), PBF #3 (0.3556)	Rankings closely aligned
Financial Services	ARBK, MBI, KRNY	ARBK #1 (1.0185), MBI #2 (0.8956), KRNY #3 (0.7703)	ARBK #1 (1.0494), GREE #2 (0.9502) missing MBI, KRNY	Partial overlap
Healthcare	CGC, CCM, TRIB	CGC #1 (1.0091), CCM #2 (0.9822), TRIB #3 (0.9636)	RPTX #1 (1.3350), IFRX #2 (1.2594) missing CGC,etc	Different picks mostly
Industrials	FWRD, EVTL	FWRD #1 (0.9572), EVTL #4 (0.7314)	NPWR #1 (1.2057), EVTL #5 (0.9657)	Some overlap
Real Estate	OPAD, AIV	AIV #1 (0.7303), OPAD #2 (0.7286)	DOUG #1 (0.8116), OPAD #5 (0.5578)	Partial overlap
Technology	RUN, WBX, ZEPP	RUN #1 (1.1565), WBX #2 (0.8604), ZEPP #3 (0.8561)	WBX #2 (1.1401), RUN #3 (1.0593), ZEPP missing	Strong agreement
Utilities	AQN	AQN #1 (0.7382)	OKLO #1 (0.8269) no AQN	Different picks

So - this is excellent model validation, I think. We see some differences due to the amount of time-period data we have, but the results are not widly different.

I think I can now use this data in LSTM perhaps. Or whatever my next steps turn out to be, because I may - before LSTM - try to do some earnings transcript parsing for these if it's possible.

Grasping Technology

Friday, August 1, 2025

AI / ML - Data Source Comparison with More Data

Side-by-side Overlap Analysis Approach

No comments:

My New Stock Prediction Model - Short Term Stock Prediction

Search This Blog