Grasping Technology: AI / ML - Modeling Fundamentals

Friday, August 1, 2025

AI / ML - Modeling Fundamentals - Mistakes Found and Corrected

After adding Earnings Surprise Score data into my 1 year fwd return predicting model, I kind of felt as though I had hit the road on the model. The Earnings Surprise Score did move the needle. But with all of the effort in Feature Engineering I had put into this model, the only thing I really felt I could add to it, was sentiment (news). Given that news is more of a real-time concept, grows stale, and would be relevant for only the latest row of data, I decided to do some final reviews, and move on, or "graduate" to some new things - like maybe trying out a neural network or doing more current or real-time analysis. In fact, I had already tried a Quarterly model, but the R-squared on it was terrible and I decided not to use it - not to even ensemble it with the annual report data model.

So - I asked a few different LLMs to code review my model. And I was horrified to learn that because of using LLMs to continually tweak my model, I had wound up with issues related to "Snippet Integration".

Specifically, I had some major issues:

1. Train/Test Split Happened Too Early or Multiple Times

Splitting data before full preprocessing (e.g., before feature scaling, imputation, or log-transforming the target).
Redundant train/test splits defined in multiple places — some commented, some active — leading to potential inconsistencies depending on which was used.

2. No Validation Set

Originally, data was split into training and test sets.

This meant that model tuning (e.g. SHAP threshold, hyperparameter selection) was inadvertently leaking test set information.

Now corrected with a clean train/val/test split.

3. Inconsistent Preprocessing Between Train and Test

Preprocessing steps like imputation, outlier clipping, or scaling were not always applied after the split.
This risked information bleeding from test → train, violating standard ML practice.

4. Improper Handling of Invalid Target Values (fwdreturn)

NaN, inf, and unrealistic values (like ≤ -100%) were not being consistently filtered.

This led to silent corruption of both training and evaluation scores.

Now fixed with a strict invalid_mask and logging/reporting of dropped rows.

5. Redundant or Conflicting Feature Definitions

There were multiple, overlapping blocks like:

features = [col for col in df.columns if ...]
X = df[features]

This made it unclear which feature list was actually being used.

Sector z-scored and raw versions were sometimes duplicated or mixed without clarity.

6. Scaling and Z-Scoring Logic Was Not Modular or Controlled

Originally, some features were being z-scored after asset-scaling (which didn’t make sense).

Some metrics were scaled both to assets and z-scored sector-wise, which polluted the modeling signal.

Now addressed with clear separation and feature naming conventions.

7. SHAP Was Applied to a Noisy or Unclean Feature Space

Without proper pruning first (e.g., dropping all-NaN columns), SHAP feature importance included irrelevant or broken columns.

This could inflate feature count or misguide model interpretation.

Now resolved by cleaning feature set before SHAP and applying SHAP-based selection only on valid, imputed data.

One issue, was that the code for modeling was in the main() function which had gotten way too long and hard to read. This function had all of the training, testing, splitting/pruning, it had the model fitting in it, AND all of the scoring. I broke out the process of train/validate/test - and put that process into play for both the full model, as well as the SHAP-pruned model. Then I took the best r-squared from both runs and used the winning model.

Grasping Technology

Friday, August 1, 2025

AI / ML - Modeling Fundamentals - Mistakes Found and Corrected

No comments:

My New Stock Prediction Model - Short Term Stock Prediction

Search This Blog