Wednesday, July 9, 2025

AI / ML - Feature Engineering - Interaction Features

I added some new macro features to my model - credit card debt, credit card delinquency, and unemployment data.

Some of these were VERY influential features.

So we can see that unemployment_rate is an important feature! It tops the list!!!

But - since we are doing relative scoring on stocks, what good does that do us, if every single stock sees the same macro values???

The answer: Interaction Features. 

Since Unemployment can impact revenue growth (less consumers can afford to buy), you multiply the Revenue Growth Year-Over-Year percentage, but the unemployment. Now, you get a UNIQUE value that works for that specific stock symbol instead of just throwing "across the board" metrics at every stock. 

Now, if you don't do this, the macro variables in and of themselves CAN impact a model, especially if a stock's forward return is sensitive to that feature. That is what XGBoost gives you. But you help the correlation by giving everyone a uniquely calculated impact, as opposed to giving everyone a value that equals "X.Y".

I did this, and got my latest high score on R-Squared.
Selected 30 features out of 97 (threshold = 0.007004095707088709)
⭐ New best model saved with R²: 0.4001

Pruned XGBoost Model R² score: 0.4001
Pruned XGBoost Model RMSE: 0.3865
Pruned XGBoost Model MAE: 0.2627

Full XGBoost R² score: 0.3831
Full XGBoost RMSE 0.3919
Full XGBoost MAE: 0.2694

Update: Now, it should be noted that both the raw macro metrics as well as the interactive ones are thrown into the model - and we let the process (i.e. SHAP) decide which ones - the raw or the interaction feature - to use in the subsequent pruned model.  

In the screenshot above, BOTH of them appear at the top. Unemployment is #1 far and away. And the unemployment_x_rev_grw metrics, an interactive feature that combines stock-specific data with the macro data, is also ranked high but below the raw metric.

I get differing opinions on whether it is good practice to leave them both in the model (raw macro features AND derived interactive features), or to decide on one or the other. It is awfully hard to prune off and not use your top influencer of prediction.

If you have any comments or thoughts on this, please feel free to comment. 


 

No comments:

Target Encoding & One-Hot Encoding

I had completely overlooked these concepts earlier on, and somehow not until just now - this late in the game - did they come up as I was re...