Grasping Technology: SHAP

Showing posts with label SHAP. Show all posts

Friday, August 1, 2025

AI / ML - Modeling Fundamentals - Mistakes Found and Corrected

After adding Earnings Surprise Score data into my 1 year fwd return predicting model, I kind of felt as though I had hit the road on the model. The Earnings Surprise Score did move the needle. But with all of the effort in Feature Engineering I had put into this model, the only thing I really felt I could add to it, was sentiment (news). Given that news is more of a real-time concept, grows stale, and would be relevant for only the latest row of data, I decided to do some final reviews, and move on, or "graduate" to some new things - like maybe trying out a neural network or doing more current or real-time analysis. In fact, I had already tried a Quarterly model, but the R-squared on it was terrible and I decided not to use it - not to even ensemble it with the annual report data model.

So - I asked a few different LLMs to code review my model. And I was horrified to learn that because of using LLMs to continually tweak my model, I had wound up with issues related to "Snippet Integration".

Specifically, I had some major issues:

1. Train/Test Split Happened Too Early or Multiple Times

Splitting data before full preprocessing (e.g., before feature scaling, imputation, or log-transforming the target).
Redundant train/test splits defined in multiple places — some commented, some active — leading to potential inconsistencies depending on which was used.

2. No Validation Set

Originally, data was split into training and test sets.

This meant that model tuning (e.g. SHAP threshold, hyperparameter selection) was inadvertently leaking test set information.

Now corrected with a clean train/val/test split.

3. Inconsistent Preprocessing Between Train and Test

Preprocessing steps like imputation, outlier clipping, or scaling were not always applied after the split.
This risked information bleeding from test → train, violating standard ML practice.

4. Improper Handling of Invalid Target Values (fwdreturn)

NaN, inf, and unrealistic values (like ≤ -100%) were not being consistently filtered.

This led to silent corruption of both training and evaluation scores.

Now fixed with a strict invalid_mask and logging/reporting of dropped rows.

5. Redundant or Conflicting Feature Definitions

There were multiple, overlapping blocks like:

features = [col for col in df.columns if ...]
X = df[features]

This made it unclear which feature list was actually being used.

Sector z-scored and raw versions were sometimes duplicated or mixed without clarity.

6. Scaling and Z-Scoring Logic Was Not Modular or Controlled

Originally, some features were being z-scored after asset-scaling (which didn’t make sense).

Some metrics were scaled both to assets and z-scored sector-wise, which polluted the modeling signal.

Now addressed with clear separation and feature naming conventions.

7. SHAP Was Applied to a Noisy or Unclean Feature Space

Without proper pruning first (e.g., dropping all-NaN columns), SHAP feature importance included irrelevant or broken columns.

This could inflate feature count or misguide model interpretation.

Now resolved by cleaning feature set before SHAP and applying SHAP-based selection only on valid, imputed data.

One issue, was that the code for modeling was in the main() function which had gotten way too long and hard to read. This function had all of the training, testing, splitting/pruning, it had the model fitting in it, AND all of the scoring. I broke out the process of train/validate/test - and put that process into play for both the full model, as well as the SHAP-pruned model. Then I took the best r-squared from both runs and used the winning model.

Wednesday, July 2, 2025

AI / ML - Altman-Z Score

I saw a website that was showing an Altman-Z score for companies in their Solvency section.

Not fully aware of this calculation, I decided to jump in and add it to my model.

I quickly backed it out.

Why?

Altman-Z uses different calculations based on the industry a company is in.

Manufacturers use one calculation, other companies use another, and banks and finance companies don't calculate it at all.

So imagine calculating this and feeding it into XGBoost / SHAP to predict price or return on a security.

First of all, because you have so many NaN values (nonexistents), you have a Missingness issue. Then, the values differ due to different calculation methods. If you don't cap the score, you can get outliers that wreak havoc.

So in the end, it's fine to calculate it, but if you calculate it, don't model it as a predictive feature.

Just calculate it and "tack it on" (staple it) to any sector-specific scores you are generating for purposes of stuff like rank within sector.

AI / ML Feature Explosion and Normalization

I got into some trouble with my model where it blew up to 150-180 features. When I finally took the time to really scrutinize things, I noticed that I had duplicates of raw metrics alongside their sector-z-scored components. I don't know how that happened, or how those sector-z-scored components got into the feature set submitted to XGBoost and SHAP. Probably logic inserted into the wrong place.

I wound up removing all of the sector-z scored metrics for now.

But this highlighted a problem. Semantics.

I had some metrics that needed to be normalized out for scoring and comparison purposes. Mostly raw metrics, and to do this, we divided the value by TotalAssets. For metrics that we did NOT want to do this to, we had some exclusion logic based on Regular Expressions (regex). We looked for metrics that had "Per" and "To" (among others).

This seems to have fixed our set of features, and it is so much better to see 30 features of 80 selected instead of 100 features of 180. It reduced a ton of noise on the model, improving its integrity.

Now I do need to go back and examine why we did the sector z-scores initially, to see if that is something we do need to engineer back in. I think we need to do that in the cases where we are producing a Top-X-By-Sector report.

Friday, June 13, 2025

AI/ML Feature Engineering - Adding Feature-Based Features

I added some new features (metrics) to my model. The Quarterly model.

To recap, I have downloaded quarterly statements for stock symbols, and I use these to calculate an absolute slew of metrics and ratios. Then I feed them into the XGBoost regression model, to figure out whether they can predict a forward return of stock price.

I added some macro economic indicators, because I felt that those might impact the quarterly price of a stock (short term) more than the pure fundamentals of the stock.

The fundamentals are used in an annual model - a separate model - and in that model, the model is not distracted or interrupted with "events" or macroeconomics that get in the way of understanding the true health of a company based on fundamentals over a years-long period of time.

So - what did I add to the quarterly model?

Consumer Sentiment
Business Confidence
Inflation Expectations
Treasury Data (1,3,10 year)
Unemployment

And wow - did these variables kick in. At one point, I had the model up to .16.

Unemployment did nothing, actually. And I wound up removing it as a noise factor. I also realized I had the fiscal quarter included, and removed that too since it, like sector and other descriptive variables, should not be in the model.

But - as I was about to put a wrap on it, I decided to do one more "push" to improve the R-squared value, and started fiddling around. I got cute, adding derived features. One of the things I did, was to add lag features for business confidence, consumer sentiment, inflation expectations. Interestingly, two of these shot to the top of influential metrics.

Feature Importance List Sorted by Importance (return price influence).
feature                 weight
business_confidence_lag1               0.059845
inflation_lag1                                      0.054764

But, others were a bust, with .00000 values.

I tried removing the original metrics and JUST keeping the lags - didn't really help.

Another thing worth noting, is that I added SHAP values - a topic I will get into more depth about shortly, perhaps in a subsequent post. SHAP (SHapley Additive exPlanations) is a method used to explain the output of machine learning models by assigning each feature an importance value for a specific prediction, so that models - like so many - are not completely "black box".

But one thing I noticed when I added the SHAP feature list, is that it does NOT match / line up with the feature importances that the XGBoost model espouses.

So I definitely need to look into this.

Grasping Technology