Friday, August 1, 2025

AI / ML - Modeling Fundamentals - Mistakes Found and Corrected

After adding Earnings Surprise Score data into my 1 year fwd return predicting model, I kind of felt as though I had hit the road on the model. The Earnings Surprise Score did move the needle. But with all of the effort in Feature Engineering I had put into this model, the only thing I really felt I could add to it, was sentiment (news). Given that news is more of a real-time concept, grows stale, and would be relevant for only the latest row of data, I decided to do some final reviews, and move on, or "graduate" to some new things - like maybe trying out a neural network or doing more current or real-time analysis. In fact, I had already tried a Quarterly model, but the R-squared on it was terrible and I decided not to use it - not to even ensemble it with the annual report data model.

So - I asked a few different LLMs to code review my model. And I was horrified to learn that because of using LLMs to continually tweak my model, I had wound up with issues related to "Snippet Integration". 

Specifically, I had some major issues:

1. Train/Test Split Happened Too Early or Multiple Times

  •  Splitting data before full preprocessing (e.g., before feature scaling, imputation, or log-transforming the target).
  •  Redundant train/test splits defined in multiple places — some commented, some active — leading to potential inconsistencies depending on which was used.


2. No Validation Set

  •  Originally, data was split into training and test sets.
    •  This meant that model tuning (e.g. SHAP threshold, hyperparameter selection) was inadvertently leaking test set information. 

  •  Now corrected with a clean train/val/test split.


3. Inconsistent Preprocessing Between Train and Test

  •  Preprocessing steps like imputation, outlier clipping, or scaling were not always applied after the split.
  •  This risked information bleeding from test → train, violating standard ML practice.


4. Improper Handling of Invalid Target Values (fwdreturn)

  •  NaN, inf, and unrealistic values (like ≤ -100%) were not being consistently filtered.
  •  This led to silent corruption of both training and evaluation scores.
  •  Now fixed with a strict invalid_mask and logging/reporting of dropped rows.


5. Redundant or Conflicting Feature Definitions

  •  There were multiple, overlapping blocks like:

                features = [col for col in df.columns if ...]
                X = df[features]

  •  This made it unclear which feature list was actually being used.
  •  Sector z-scored and raw versions were sometimes duplicated or mixed without clarity.


6. Scaling and Z-Scoring Logic Was Not Modular or Controlled

  •  Originally, some features were being z-scored after asset-scaling (which didn’t make sense).
  •  Some metrics were scaled both to assets and z-scored sector-wise, which polluted the modeling signal.
  •  Now addressed with clear separation and feature naming conventions.


7. SHAP Was Applied to a Noisy or Unclean Feature Space

  •  Without proper pruning first (e.g., dropping all-NaN columns), SHAP feature importance included irrelevant or broken columns.
    •  This could inflate feature count or misguide model interpretation.
  •  Now resolved by cleaning feature set before SHAP and applying SHAP-based selection only on valid, imputed data.
One issue, was that the code for modeling was in the main() function which had gotten way too long and hard to read. This function had all of the training, testing, splitting/pruning, it had the model fitting in it, AND all of the scoring. I broke out the process of train/validate/test - and put that process into play for both the full model, as well as the SHAP-pruned model. Then I took the best r-squared from both runs and used the winning model.

No comments:

AI / ML - Data Source Comparison with More Data

" To validate generalizability, I ran the same model architecture against two datasets: a limited Yahoo-based dataset and a deeper Stoc...