After adding Earnings Surprise Score data into my 1 year fwd return predicting model, I kind of felt as though I had hit the road on the model. The Earnings Surprise Score did move the needle. But with all of the effort in Feature Engineering I had put into this model, the only thing I really felt I could add to it, was sentiment (news). Given that news is more of a real-time concept, grows stale, and would be relevant for only the latest row of data, I decided to do some final reviews, and move on, or "graduate" to some new things - like maybe trying out a neural network or doing more current or real-time analysis. In fact, I had already tried a Quarterly model, but the R-squared on it was terrible and I decided not to use it - not to even ensemble it with the annual report data model.
So - I asked a few different LLMs to code review my model. And I was horrified to learn that because of using LLMs to continually tweak my model, I had wound up with issues related to "Snippet Integration".
Specifically, I had some major issues:
1. Train/Test Split Happened Too Early or Multiple Times
- Splitting data before full preprocessing (e.g., before feature scaling, imputation, or log-transforming the target).
- Redundant train/test splits defined in multiple places — some commented, some active — leading to potential inconsistencies depending on which was used.
2. No Validation Set
- Originally, data was split into training and test sets.
- This meant that model tuning (e.g. SHAP threshold, hyperparameter selection) was inadvertently leaking test set information.
- Now corrected with a clean train/val/test split.
3. Inconsistent Preprocessing Between Train and Test
- Preprocessing steps like imputation, outlier clipping, or scaling were not always applied after the split.
- This risked information bleeding from test → train, violating standard ML practice.
4. Improper Handling of Invalid Target Values (fwdreturn)
- NaN, inf, and unrealistic values (like ≤ -100%) were not being consistently filtered.
- This led to silent corruption of both training and evaluation scores.
- Now fixed with a strict invalid_mask and logging/reporting of dropped rows.
5. Redundant or Conflicting Feature Definitions
- There were multiple, overlapping blocks like:
features = [col for col in df.columns if ...]
X = df[features]
- This made it unclear which feature list was actually being used.
- Sector z-scored and raw versions were sometimes duplicated or mixed without clarity.
6. Scaling and Z-Scoring Logic Was Not Modular or Controlled
- Originally, some features were being z-scored after asset-scaling (which didn’t make sense).
- Some metrics were scaled both to assets and z-scored sector-wise, which polluted the modeling signal.
- Now addressed with clear separation and feature naming conventions.
7. SHAP Was Applied to a Noisy or Unclean Feature Space
- Without proper pruning first (e.g., dropping all-NaN columns), SHAP feature importance included irrelevant or broken columns.
- This could inflate feature count or misguide model interpretation.
- Now resolved by cleaning feature set before SHAP and applying SHAP-based selection only on valid, imputed data.
No comments:
Post a Comment