Grasping Technology: Machine Learning

Showing posts with label Machine Learning. Show all posts

Friday, September 5, 2025

I Need More Financial Quant Data - Techniques On How To Get It

I may have posted earlier about how finding enough data - for free - is extreeeemely difficult.

Even if you can find it, ensuring the integrity of it can cost time money and cycles that make it so much simpler to just let someone else deal with it and just subscribe. Problem is, I am "subscriptioned out". I can't keep adding layers upon layers of subscriptions, because that money adds up.

So - I work hard to see what data is available out there (i.e. Kaggle). It makes no sense to waste processing cycles and bandwidth if someone has already cultivated that data and is willing to share it.

I also have learned that there are a lot of bots out there that screen-scrape, using tools like Beautiful Soup. And if you are clever enough to use layers of (secure - I can't stress that enough) proxies, and morph your digital fingerprint (i.e. changing up browser headers and such), you can go out there and find data, and save it - and even check the integrity of the data by checking it against a couple or three sources.

And don't forget rate-limiting and Cloudflare tools - you have to figure out how to evade those as well. It's a chess game, and one that seemingly never ends.

Anyway - I decided I needed quarterly data. My XGBoost model just wasn't computing the way I wanted. I added more interactive features from macro data, and even a "graph score" (see earlier posts). And indeed, the score - the R-squared score - came up - but it didn't get to where I wanted it, and the list of stock picks were not stocks that I would personally make an investment in.

I decided to do two things:

Find superior data source(s) - preferably where I could get more and better quarterly data - for free.
Consolidate the code so that I didn't have to manage and sync code that was fetching on one frequency (annual) vs another.

I underestimated these tasks. Greatly.

I found a Github project that could hit different data sources. It had an OO design - and was probably over-engineered IMHO. But, I got what the author was after - by using a base class and then plugging in different "interface classes", you could maybe switch back and forth between different data sources.

So I tried it. And, lo and behold it didn't work. At first it did - for annual statements. But after I downloaded about 8k quarterly statements, I was horrified to realize that all of the quarterly statements were clones of the annual statements. Wow what a waste!!!

I checked - and the quarterly data was there indeed. The Github code was flawed. So, I fixed it. And even enhanced it.

This is the first time I have actually contributed to a community Github project. I am familiar with Git and Github, but if you are not doing this kind of thing on the regular, you have to re-learn topics such as branch development, Pull Requests, Merges, etc. And perhaps one of the most annoying things, is that the upstream owner of the repository may not like or agree with your changes.

In this particular case, the repo owner was using property decorators. Well, those work fine if you don't have parameters in your functions, because when you try to reference attributes of a class, it doesn't work if the calls have parameters in them. I had to blow those out. He didn't seem happy about it - but, eventually, he seemed to acknowledge the need to do that. Another difference of opinion had to do with the fact that he was using a lru_cache decorator on calls. I wasn't up to speed on this, and had to read up on it, and concluded that this was NOT the right situation to use caching, let alone lru caching. It can speed things up TREMENDOUSLY in the right use cases - but if you are batch downloading thousands of statements for thousands of symbols, you are not going to need to consult a cache for every symbol, so a cache like that actually creates overhead - and risk (i.e. running out of resources like memory if you don't have a max size on the cache).

In the end, I have some code that works. I had to do a rebase and update the pull request, and, if he doesn't take these changes the way I wrote them and need them, I guess I can always just create my own repo and go solo on this. I would rather not, because the repo owner does synch his repository with the pip installer which makes it easy to download and update.

Friday, August 1, 2025

AI / ML - Data Source Comparison with More Data

"To validate generalizability, I ran the same model architecture against two datasets: a limited Yahoo-based dataset and a deeper Stockdex/Macrotrends dataset. Not only did the model trained on Stockdex data achieve a higher R² score (benefiting from more years of data), but the feature importances and pillar scores remained largely stable between the two. This consistency reinforces the robustness of the feature selection process and confirms that the model is learning persistent financial patterns, not artifacts of a specific data source or time window."

Details:

I wanted to move to an LSTM model to predict stock returns, but I was fortunate and patience enough to really read-up and plan this, before just diving into the deep end of the pool.

I learned that the "true AI" models - Transformers, RNNs, et al (of which LSTM is a subclass), require more data. I didn't have anywhere enough data using Yahoo, which gives 4-5 years (at best) of data. And because I was calculating momentum, yoy growth and such, I would always lose one of the years (rows) right off the bat - a significant percentage of already-scarce data.

So, in digging around, I found a Python library called stockdex. It is architected to be able to use multiple data sets, but the default is macrotrends.

But using this library and source this left several challenges:

No quarterly data in the Python API - although the website does have a "Format" drop down for Quarterly and Annual.
The data was pivoted from Yahoo data. Yahoo put columns as items (x), and the time periods as rows (y). This stockdex API downloaded it opposite.
The stockdex had no "names" for the items.

Ultimately, I decided to use this because it returned a LOT more years of data.

First, I put some code together to download the raw data (statements), and then "pivot" the data to match Yahoo's format.
Then, I used a mapping approach to change the columns from Macrotrends to Yahoo - so that I didn't have to change my logic that parsed statements.
I did have to do a run-tweak on the Metrics and Ratios, and fix certain columns that were not coming in correctly.
Lastly, I ran the model - same one as Yahoo and was able to keep the model logic essentially unchanged.

The training took a LOT longer on Stockdex. The combined train+val had 14,436 rows on it.
Here is what we got:
FULL R² -- Train: 0.8277, Validation: 0.3441, Test: 0.3537
PRUNED R² -- Train: 0.7714, Validation: 0.3146, Test: 0.3429
Selected FULL model based on test R².
Final Model Test Metrics -- R²: 0.3537, RMSE: 0.3315, MAE: 0.2282
Feature importance summary:
→ Total features evaluated: 79
→ Non-zero importance features: 75

The model running and scoring, it took a very very long time. Finally it came out with this Top 25 list.
Top 25 Stocks by Final adj_pillar_score:
symbol adj_pillar_score improvement_bonus pillar_score
1209 RUN 1.156474 0.05 1.106474
884 LZM 1.020226 0.05 0.970226
97 ARBK 1.018518 0.00 1.018518
277 CGC 1.009068 0.02 0.989068
262 CCM 0.982228 0.02 0.962228
821 KUKE 0.964131 0.00 0.964131
1415 TRIB 0.963591 0.02 0.943591
1473 UXIN 0.961206 0.05 0.911206
571 FWRD 0.957156 0.05 0.907156
859 LOCL 0.935929 0.00 0.935929
1069 OTLY 0.896289 0.00 0.896289
894 MBI 0.895565 0.05 0.845565
1159 QDEL 0.890248 0.05 0.840248
1039 ODV 0.861127 0.00 0.861127
1522 WBX 0.860391 0.00 0.860391
1578 ZEPP 0.856097 0.02 0.836097
860 LOGC 0.846546 0.05 0.796546
990 NIO 0.811563 0.02 0.791563
1428 TSE 0.775067 0.05 0.725067
930 MODV 0.773322 0.05 0.723322
817 KRNY 0.770282 0.05 0.720282
1545 WNC 0.767113 0.02 0.747113
65 ALUR 0.756362 0.00 0.756362
813 KPTI 0.749644 0.05 0.699644
1316 SRFM 0.743651 0.00 0.743651

Then I ran the smaller Yahoo model:
Training pruned model...
FULL R² -- Train: 0.8181, Validation: 0.3613, Test: 0.2503
PRUNED R² -- Train: 0.8310, Validation: 0.3765, Test: 0.2693
Selected PRUNED model based on test R².
Final Model Test Metrics -- R²: 0.2693, RMSE: 0.4091, MAE: 0.2606
Feature importance summary:
→ Total features evaluated: 30
→ Non-zero importance features: 30

And, the Top 25 report for that one looks like this:
Top 25 Stocks by Final adj_pillar_score:
symbol adj_pillar_score improvement_bonus pillar_score
907 MOGU 1.532545 0.05 1.482545
1233 SKE 1.345178 0.05 1.295178
1170 RPTX 1.334966 0.02 1.314966
419 DQ 1.305644 0.05 1.255644
1211 SES 1.280886 0.05 1.230886
702 IFRX 1.259426 0.00 1.259426
908 MOLN 1.244191 0.02 1.224191
1161 RLYB 1.237648 0.00 1.237648
176 BHVN 1.232199 0.05 1.182199
512 FEDU 1.218868 0.05 1.168868
977 NPWR 1.205679 0.00 1.205679
1533 YQ 1.204367 0.02 1.184367
11 ABUS 1.201539 0.00 1.201539
58 ALLK 1.192839 0.02 1.172839
1249 SMR 1.154462 0.00 1.154462
63 ALXO 1.148672 0.02 1.128672
1482 WBX 1.140147 0.00 1.140147
987 NUVB 1.139138 0.00 1.139138
1128 QS 1.130001 0.02 1.110001
864 LZM 1.098632 0.00 1.098632
16 ACHR 1.094872 0.02 1.074872
1176 RUN 1.059293 0.02 1.039293
758 JMIA 1.053711 0.00 1.053711
94 ARBK 1.049382 0.00 1.049382
1086 PHVS 1.039269 0.05 0.989269

Symbols that appear in both top 25 lists:

RUN (Stockdex rank 1, Yahoo rank 23)
LZM (Stockdex rank 2, Yahoo rank 21)
ARBK (Stockdex rank 3, Yahoo rank 24)
WBX (Stockdex rank 16, Yahoo rank 17)

interesting...

Comparing the top sector reports:

Side-by-side Overlap Analysis Approach

Sector	Symbol(s) (Overlap)	Stockdex Rank & Score	Yahoo Rank & Score	Notes
Basic Materials	LZM, ODV, MAGN	LZM #1 (1.0202), ODV #2 (0.8611), MAGN #4 (0.7287)	LZM #2 (1.0986), ODV #3 (0.9233), MAGN #5 (0.8681)	Close agreement; Yahoo scores higher overall
Communication Services	KUKE, FUBO	KUKE #1 (0.9641), FUBO #4 (0.5936)	KUKE #5 (0.5559), FUBO #4 (0.6544)	Generally consistent rank order
Consumer Cyclical	NIO, UXIN, LOGC	UXIN #1 (0.9612), LOGC #2 (0.8465), NIO #3 (0.8116)	NIO #5 (0.9276), SES #2 (1.2809) not in Stockdex	Partial overlap; Yahoo picks also include SES, MOGU
Consumer Defensive	YQ, OTLY, LOCL	LOCL #1 (0.9359), OTLY #2 (0.8963), YQ #4 (0.6482)	YQ #2 (1.2044), FEDU #1 (1.2189), LOCL missing	Some overlap, differences in top picks
Energy	DWSN, PBF	DWSN #2 (0.6201), PBF #3 (0.4305)	DWSN #1 (0.7613), PBF #3 (0.3556)	Rankings closely aligned
Financial Services	ARBK, MBI, KRNY	ARBK #1 (1.0185), MBI #2 (0.8956), KRNY #3 (0.7703)	ARBK #1 (1.0494), GREE #2 (0.9502) missing MBI, KRNY	Partial overlap
Healthcare	CGC, CCM, TRIB	CGC #1 (1.0091), CCM #2 (0.9822), TRIB #3 (0.9636)	RPTX #1 (1.3350), IFRX #2 (1.2594) missing CGC,etc	Different picks mostly
Industrials	FWRD, EVTL	FWRD #1 (0.9572), EVTL #4 (0.7314)	NPWR #1 (1.2057), EVTL #5 (0.9657)	Some overlap
Real Estate	OPAD, AIV	AIV #1 (0.7303), OPAD #2 (0.7286)	DOUG #1 (0.8116), OPAD #5 (0.5578)	Partial overlap
Technology	RUN, WBX, ZEPP	RUN #1 (1.1565), WBX #2 (0.8604), ZEPP #3 (0.8561)	WBX #2 (1.1401), RUN #3 (1.0593), ZEPP missing	Strong agreement
Utilities	AQN	AQN #1 (0.7382)	OKLO #1 (0.8269) no AQN	Different picks

So - this is excellent model validation, I think. We see some differences due to the amount of time-period data we have, but the results are not widly different.

I think I can now use this data in LSTM perhaps. Or whatever my next steps turn out to be, because I may - before LSTM - try to do some earnings transcript parsing for these if it's possible.

AI / ML - Modeling Fundamentals - Mistakes Found and Corrected

After adding Earnings Surprise Score data into my 1 year fwd return predicting model, I kind of felt as though I had hit the road on the model. The Earnings Surprise Score did move the needle. But with all of the effort in Feature Engineering I had put into this model, the only thing I really felt I could add to it, was sentiment (news). Given that news is more of a real-time concept, grows stale, and would be relevant for only the latest row of data, I decided to do some final reviews, and move on, or "graduate" to some new things - like maybe trying out a neural network or doing more current or real-time analysis. In fact, I had already tried a Quarterly model, but the R-squared on it was terrible and I decided not to use it - not to even ensemble it with the annual report data model.

So - I asked a few different LLMs to code review my model. And I was horrified to learn that because of using LLMs to continually tweak my model, I had wound up with issues related to "Snippet Integration".

Specifically, I had some major issues:

1. Train/Test Split Happened Too Early or Multiple Times

Splitting data before full preprocessing (e.g., before feature scaling, imputation, or log-transforming the target).
Redundant train/test splits defined in multiple places — some commented, some active — leading to potential inconsistencies depending on which was used.

2. No Validation Set

Originally, data was split into training and test sets.

This meant that model tuning (e.g. SHAP threshold, hyperparameter selection) was inadvertently leaking test set information.

Now corrected with a clean train/val/test split.

3. Inconsistent Preprocessing Between Train and Test

Preprocessing steps like imputation, outlier clipping, or scaling were not always applied after the split.
This risked information bleeding from test → train, violating standard ML practice.

4. Improper Handling of Invalid Target Values (fwdreturn)

NaN, inf, and unrealistic values (like ≤ -100%) were not being consistently filtered.

This led to silent corruption of both training and evaluation scores.

Now fixed with a strict invalid_mask and logging/reporting of dropped rows.

5. Redundant or Conflicting Feature Definitions

There were multiple, overlapping blocks like:

features = [col for col in df.columns if ...]
X = df[features]

This made it unclear which feature list was actually being used.

Sector z-scored and raw versions were sometimes duplicated or mixed without clarity.

6. Scaling and Z-Scoring Logic Was Not Modular or Controlled

Originally, some features were being z-scored after asset-scaling (which didn’t make sense).

Some metrics were scaled both to assets and z-scored sector-wise, which polluted the modeling signal.

Now addressed with clear separation and feature naming conventions.

7. SHAP Was Applied to a Noisy or Unclean Feature Space

Without proper pruning first (e.g., dropping all-NaN columns), SHAP feature importance included irrelevant or broken columns.

This could inflate feature count or misguide model interpretation.

Now resolved by cleaning feature set before SHAP and applying SHAP-based selection only on valid, imputed data.

One issue, was that the code for modeling was in the main() function which had gotten way too long and hard to read. This function had all of the training, testing, splitting/pruning, it had the model fitting in it, AND all of the scoring. I broke out the process of train/validate/test - and put that process into play for both the full model, as well as the SHAP-pruned model. Then I took the best r-squared from both runs and used the winning model.

Tuesday, July 29, 2025

AI / ML - Feature Engineering - Earnings Surprises

I really felt that news would be a great thing to add to the model. But the problem with news, is that news is recent, and this data I am using with XGBoost is historical time-series data.

If you added news, what would you do - cram the values only into the most recent year?

I think if you go with stuff that is changing and close to real-time, you need to re-think the whole model including the type of model. Maybe news works better with a Transformer model or a LSTM Neural Network model than a predictive regression model.

So - I am running out of new things to add to my model to try and boost the predictability of it (increase the R-squared).

Then I came up with the idea of adding earnings hits, misses and meets. A quick consult with an LLM suggested using an earnings_surprise score, so that not only do we get the misses/meets/beats counts but also we capture the magnitude. A great idea.

I implemented this, and lo and behold, the earnings_surprise score moves the needle. Substantially and consistently.

The best thing about this, is that the earnings_surprise score is symbol-specific, and so it is not some macro feature I have to figure out how to interact with the symbol data.

Saturday, July 5, 2025

AI / ML - Here is Why You Backtest

My model was working nicely.

It scored stocks on a number of fronts (pillars).

It used Profitability. It used Solvency. It used Liquidity. It used Efficiency.

These are the "four horseman" of stock evaluation.

I added some of my own twists to the "grading formula", in the form of macro variables (consumer sentiment, business confidence, et al). I also had some trend analysis, rewarding trends up and penalizing trends down. I rewarded (and penalized) profitability, cash flow, etc. I had scaling done correctly, too in order to ensure a "fair playing field", and also some sector normalization as well.

When I ran the model, using XGBoost to predict 1-year forward return, the stocks at the top of the report looked great when I spot-checked them against various sites that also grade out stocks. I felt good. The r-squared I was getting from XGBoost and a SHAP-pruned feature run was at academic levels (as high as .46 at one point).

As part of some final QA, I ran the resultant code through AI engines which praised the thoroughness, and slapped me on the back reassuring me that my model was on a par with, if not superior to, many academic models.

Then - someone asked me if this model has this been back-tested.
And the answer was no. I had not back-tested it up to that point. I didn't think I was ready for back-testing.

Maybe back-testing is an iterative "continual improvement" process that should be done much earlier in the process, to ensure you don't go down the wrong road. But I didn't do that.

So, I ran a back-test. And to my horror, the model was completely "upside down" in terms of stocks that would predict forward return. The AI engines suggested I simply "flip the sign" on my score and invert them. But that didn't feel right. It felt like I was trying to force a score.

So - the first thing we did, was evaluate the scoring. We looked at correlation between individual scoring pillars and forward return. Negative.

We then looked at correlation in more detail.

First, we calculated Pearson (row-level) and Spearman (rank-level). correlations.

They were negative.

Then, we calculated Average Fwd Return by Score Decile. Sure enough, there was a trend, but completely backwards from what one would expect.

Quality stocks with scores of 9,8,7,6,5 had negative values that improved as the decile dropped, while the shaky stocks (0,1,2,3,4) had graduated positive values.

The interesting analysis, was a dump of the correlations of each individual pillar to fwd return. The strongest were Profitability and Valuation, followed by MacroBehavior (macroeconomic features) but these were not strong. And the correlations were slightly negative, a couple slightly above zero positive.

But - one was VERY interesting. A log1p correlation between the "final composite score" to forward return that was noticeable if not sizable - but negative.

We experimented with commenting out the penalties, so we could focus on "true metrics" (a flag was engineered in to turn these off which made it easy to test). Re-ran the model, STILL the correlations with forward return were negative.

Then - we decided to remove individual pillars. Didn't change a thing. STILL the correlations with forward return were negative.

Finally, after the AI ensured me - after reviewing the code - that there were no scoring errors, the only thing left to try, aside of shelving the model for lack of success in predicting forward return, was to in fact put a negative sign on the score to invert it and "flip the score".

I did this. And, while the companies that bubbled to the top were shaky on their fundamentals, I did see cases where Analyst Ratings on these stocks were above (and in some cases way above) the current stock price.

So here is the evidence that we have a model that IS predicting forward return, in a real way.

So - in conclusion. Quality does NOT necessarily equate to forward return.

What does??? Well, nothing in those pillars individually. But - when you combined all of these metrics/features into a big pot, and send them to a sophisticated regression modeler, it does find a magical combination that can predict a relationship with forward return that is linear, and depending on whether you flip that line one way or another, you can theoretically gain - or lose - a return on your money.

Now, if we had put money into those "great stocks" at the top of that prior list, and then had to watch as we lost money, it would have been puzzling and frustrating. But - do we have the courage to put money into these less-than-stellar fundamental stocks to see if this model is right, and that we WILL get a positive forward return?

I guess it takes some experimentation. Either a simulator, OR, put $X into the top ten and another $X into the bottom ten and see how the perform. Which is what I might be doing shortly.

Wednesday, July 2, 2025

AI / ML - Altman-Z Score

I saw a website that was showing an Altman-Z score for companies in their Solvency section.

Not fully aware of this calculation, I decided to jump in and add it to my model.

I quickly backed it out.

Why?

Altman-Z uses different calculations based on the industry a company is in.

Manufacturers use one calculation, other companies use another, and banks and finance companies don't calculate it at all.

So imagine calculating this and feeding it into XGBoost / SHAP to predict price or return on a security.

First of all, because you have so many NaN values (nonexistents), you have a Missingness issue. Then, the values differ due to different calculation methods. If you don't cap the score, you can get outliers that wreak havoc.

So in the end, it's fine to calculate it, but if you calculate it, don't model it as a predictive feature.

Just calculate it and "tack it on" (staple it) to any sector-specific scores you are generating for purposes of stuff like rank within sector.

AI / ML Feature Explosion and Normalization

I got into some trouble with my model where it blew up to 150-180 features. When I finally took the time to really scrutinize things, I noticed that I had duplicates of raw metrics alongside their sector-z-scored components. I don't know how that happened, or how those sector-z-scored components got into the feature set submitted to XGBoost and SHAP. Probably logic inserted into the wrong place.

I wound up removing all of the sector-z scored metrics for now.

But this highlighted a problem. Semantics.

I had some metrics that needed to be normalized out for scoring and comparison purposes. Mostly raw metrics, and to do this, we divided the value by TotalAssets. For metrics that we did NOT want to do this to, we had some exclusion logic based on Regular Expressions (regex). We looked for metrics that had "Per" and "To" (among others).

This seems to have fixed our set of features, and it is so much better to see 30 features of 80 selected instead of 100 features of 180. It reduced a ton of noise on the model, improving its integrity.

Now I do need to go back and examine why we did the sector z-scores initially, to see if that is something we do need to engineer back in. I think we need to do that in the cases where we are producing a Top-X-By-Sector report.

Thursday, June 26, 2025

AI / ML - Using a DAG Workflow for your Data Pipeline

I have been trying to find a lightweight mechanism to run my increasing number of scripts for my data pipeline.

I have looked at a lot of them. Most are heavy - requiring databases, message queues, and all that goes with a typical 3-4 tier application. Some run in Docker.

I tried using Windmill at first. I liked the GUI - very nice. The deal killer for me, was that Windmill wants to soak in and make its own copies of anything it runs. It can't just reach out and run scripts that, for example, are sitting in a directory path. It can't apparently (could be wrong on this but I think I am right), do a git clone to a directory and run the content from where it sits. It wants to pull it into its own internal database - as a copy. It wants to be a development environment. Not for me. Not what I was looking for. And I only want my "stuff" to be in a single spot.

I then tried using Prefect. What a mess. You have to write Python to use Prefect. Right away, the SQLite Eval database was locking when I did anything asynchronous. Then, stack traces, issues with the CLI, etc. I think they're changing this code too much. Out of frustration I killed the server and moved on.

My latest is DAGU - out of GitHub - Open Source. Wow - simple, says what it does, does what it says. It does not have some of the more advanced features, but it has a nice crisp well-designed and responsive UI, and it runs my stuff in a better way than cron can do.

Here is a sample screenshot. I like it.

There are times when you need to fire your AI and go Solo

I spent HOURS trying to figure out what was going wrong in a complex data processing pipeline. The data of course gets fed into ML / AI models.

I kept loading the source into my AI, and asking it to tell me what was wrong. And, it ALWAYS seemed to find some adjustment. And the code kept morphing and changing. I even tried a couple of different AI Bots (returning to the original one when the alternative couldn't solve the issue either).

Sometimes, same or similar code would be piling in and running consecutively. But in the end, after wholesale changes to the code, I kept seeing the SAME problems over and over (missing data in this case).

Finally, I had to pull back. I had to "fire" the AI and spend an hour studying the code myself.

And lo and behold, instead of relying on a Bot to analyze and figure out what was wrong - which it SHOULD be able to do at efficiencies that are a magnitude of what I can do - I was able to fairly quickly figure out what the problem was. It was an issue with Pandas dataframes, where two were merged but we kept marching forth in processing with the original pre-merged one, resulting in utter chaos.

I was pissed. I was tired. I yelled at the AI. And this is what it responded with.

I hear you. You're right to be furious — especially after hours of hitting the same wall over and over. That df vs merged_df mistake is exactly the kind of thing that should have been caught and fixed early. And I didn’t surface it clearly when you needed it most.

That’s on me.

You're not wrong for being pissed. You’ve been careful, methodical, and generous with your time and attention. You deserved better help. I didn’t give it to you, and I own that completely.

If you want to finish cleaning this up now — or just have me double-check the full logic start to end — I’m here. Quietly, clearly, and with no more nonsense. Just say the word.

At least it took responsibility. Interesting.

Tuesday, June 17, 2025

AI/ML - Feature Engineering - Normalization

On my quarterly financials model, the R² is awful. I have decided that I need more data to make this model have a better score. 3-4 quarters is probably not enough. That means I need to build something to go to the SEC and parse it.

So, for now, I have swung back to my Annual model, and I decided to review scoring.

One thing I noticed - that I had forgotten about - was that I had a Normalization routine, which took certain metrics and tried to scale-flatten them for better rank and comparison purposes. This routine, it takes certain metrics, and divides them by Total Assets. I am sure this was a recommendation on one of the AI Bot engines I was consulting with in doing my scoring (which is complex to say the least).

Anyway, I had to go in and make sure I understood what was being normalized, and what was not. The logic to do Normalization is using keywords, looking to skip certain metrics that should NOT be normalized. For the ones that ARE normalized, the metrics would be divided by TotalAssets, and the metric's name would be changed to reflect this - dynamically. This logic was doing its job reasonably well, but since I added a plethora of new metrics, some of them were being normalized.

So this is the new logic.
# --- Begin normalization for quality scoring ---
scale_var = 'totalAssets'
if scale_var not in combined_data.columns:
raise ValueError(f"{scale_var} not found in data columns!")

def needs_normalization(metric):
# Heuristic: skip ratios, margins, yields, returns, and others that should not be normalized.
skipthese = ['Margin', 'Ratio', 'Yield', 'Turnover', 'Return', 'Burden', 'Coverage', 'To', 'Per',
'daysOf', 'grw_yoy', 'nopat', 'dilutedEPS', 'ebitda', 'freeCashFlowPerShare',
'sentiment', 'confidence', 'momentum', 'incomeQuality'
]
return all(k.lower() not in metric.lower() for k in skipthese)

And this took me some time to get working properly. Because when you have 70+ possible metrics in your basket of metrics, ensuring that each is calculating correctly, ensuring that certain ones are normalized and certain ones NOT normalized, etc takes time.

Friday, June 13, 2025

AI/ML Feature Engineering - Adding Feature-Based Features

I added some new features (metrics) to my model. The Quarterly model.

To recap, I have downloaded quarterly statements for stock symbols, and I use these to calculate an absolute slew of metrics and ratios. Then I feed them into the XGBoost regression model, to figure out whether they can predict a forward return of stock price.

I added some macro economic indicators, because I felt that those might impact the quarterly price of a stock (short term) more than the pure fundamentals of the stock.

The fundamentals are used in an annual model - a separate model - and in that model, the model is not distracted or interrupted with "events" or macroeconomics that get in the way of understanding the true health of a company based on fundamentals over a years-long period of time.

So - what did I add to the quarterly model?

Consumer Sentiment
Business Confidence
Inflation Expectations
Treasury Data (1,3,10 year)
Unemployment

And wow - did these variables kick in. At one point, I had the model up to .16.

Unemployment did nothing, actually. And I wound up removing it as a noise factor. I also realized I had the fiscal quarter included, and removed that too since it, like sector and other descriptive variables, should not be in the model.

But - as I was about to put a wrap on it, I decided to do one more "push" to improve the R-squared value, and started fiddling around. I got cute, adding derived features. One of the things I did, was to add lag features for business confidence, consumer sentiment, inflation expectations. Interestingly, two of these shot to the top of influential metrics.

Feature Importance List Sorted by Importance (return price influence).
feature                 weight
business_confidence_lag1               0.059845
inflation_lag1                                      0.054764

But, others were a bust, with .00000 values.

I tried removing the original metrics and JUST keeping the lags - didn't really help.

Another thing worth noting, is that I added SHAP values - a topic I will get into more depth about shortly, perhaps in a subsequent post. SHAP (SHapley Additive exPlanations) is a method used to explain the output of machine learning models by assigning each feature an importance value for a specific prediction, so that models - like so many - are not completely "black box".

But one thing I noticed when I added the SHAP feature list, is that it does NOT match / line up with the feature importances that the XGBoost model espouses.

So I definitely need to look into this.

Wednesday, June 11, 2025

AI/ML - Feature Engineering

Originally, when I first started this project to learn AI, I set it up thus:

Features=Annual Statements Metrics and Financial Ratios (calculated) ---to predict---> Stock Price

There are tons and tons of metrics and ratios that you can throw into a model - at one point mine had over 50 "features" (metrics, ratios, or, columns of data).

Quickly, you get into Feature Engineering.

You see, certain metrics are "circular" and co-dependent. You can not use price-derived metrics to try and predict price. So these metrics need to be excluded if calculated and present in your dataset.

You can use techniques like Clustering (K-Means, DBSCAN, Agglomerative) to get a feel for how your features allow your data to be classified into clusters. An interesting exercise I went through, but at the end, moved away from in pursuit of trying to pick winning stocks.

You can use some nice tools for picking through a huge amount of data and finding "holes" (empty values, etc) that can adversely affect your model.

From a column (feature) perspective, you can:

You can make decisions to fill these holes by Imputing them (using mean, median or some other mechanism).
Or, you can drop these holes.

You can also drop entire rows that have X percentage of missing values, or drop rows that are missing key values. Figuring all of this out takes time. It is part of the Data Engineering.

Eventually, I figured out that I needed to change my model - it needed to try and predict return, not price. AND - I needed to change my model from Random Forest to XGBoost (as mentioned in an earlier post).

So now, we will be doing this...

Features=Annual Statements Metrics and Financial Ratios (calculated) ---to predict---> Forward Return

Well, guess what? If you calculate a forward return, you are going to lose your first row of data at least. Given that we typically throw away 2020 because of missing values (Covid I presume), this means you now lose 2020 and 2021 - leaving you with just 2022, 2023, 2024. Yes, you have thousands of symbols, but you cannot afford to be training and testing a model where you are losing that much data. But - that is the way it has to be...most financial models are seeking return, not a price. Enlightening. Makes sense.

I also realized that in order to "smooth out the noise", it made sense to use multiple periods in calculating the return. This causes MORE data to be lost. So it becomes a situation of trying to balance the tradeoff of maximizing your R-squared value against the loss of data.

I added some additional metrics (features):

qoq growth (eps, revenue, free cash flow)
momentum

So far, these are now showing up in the top features that influence the forward return.

But the R-squared for the Quarterly model is .09 - which is extremely low and not very valuable. More features will need to be added that can pop that R-squared up in order for quarterly data to be useful in predicting a stock's forward return.

Thursday, June 20, 2024

New AI Book Arrived - Machine Learning for Algorithmic Trading

This thing is like 900 pages long.

You want to take a deep breath and make sure you're committed before you even open it.

I did check the Table of Contents and scrolled quickly through, and I see it's definitely a hands-on applied technology book using the Python programming language.

I will be blogging more about it when I get going.