Grasping Technology: Target Encoding & One-Hot Encoding

I had completely overlooked these concepts earlier on, and somehow not until just now - this late in the game - did they come up as I was reading some of the advanced sections of the Jansen Algorithmic Trading book.

When you run a model, there are always going to be things you do NOT want to include. For many reasons (circular correlation, redundancy, etc). Generally, you would not want to include symbol, sector and industry in your model because the symbol after all, is nothing more than an index in the data - and the symbol in and of itself should have NOTHING to do with predicting returns.

So we keep an Exclusion array of metrics we don't want included:
EXCLUDED_FEATURES = [
# 'symbol',
'date', 'year', 'quarter', 'close', 'fwdreturn', # identity + target
# 'sector', 'industry', optional: include via one-hot or embeddings if desired
...
]

Symbol, Sector and Industry were initially excluded because they are text strings, and models don't like text data (some won't run without only numerics, some give warnings, others unpredictable results if they tolerate it).

Then I read about target encodings for symbols - which allow you to KEEP symbols in the data, and something called "one-hot" encodings for things like sectors and industries.

What these encodings do, is they take categorical data - like sectors and industries - and they convert them to a numeric so that they will be considered by the model. This captures the association between category and target, sometimes boosting model power.

So - adding the code to do this wasn't terribly painful. The symbol was encoded as a single encoded target feature (symbol_enc) - and removed from the above-mentioned EXCLUDED_FEATURES list. The code to one-hot encode the Sector and Industry was snapped in, and off we went...

The R-squared dropped from .11 to .06. Yikes...
🪓 Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4989, Validation: 0.0776, Test: 0.0673
PRUNED R² -- Train: 0.2216, Validation: 0.0348, Test: 0.0496
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0673
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2133, MAE: 0.1471

I quickly realized, that by one-hot encoding the Industry, we got so many feature category permutations of the industries, that it undoubtedly screwed up the model.

I decided to run again, and only one-hot encode the Sector. So now with this run, we have a target encoding for symbol, and a one-hot encoding for sector. There are not that many sectors, so this doesn't become too unwieldy.

But - here is what we got on our SHAP analysis after this change:
🧮 Feature Selection Overview:
✅ consumer_sentiment_×_earnings_yield SHAP=0.04511
✅ dilutedEPS SHAP=0.02566
✅ rev_grw_4q SHAP=0.01191
✅ evToSales SHAP=0.01181
✅ roe SHAP=0.01071
✅ symbol_enc SHAP=0.01055
✅ business_confidence_×_capexToRevenue SHAP=0.00925
✅ treasury_spread_x_debt_to_equity SHAP=0.00867
✅ cpippi_marginsqueeze SHAP=0.00719
✅ earningsYield SHAP=0.00703
✅ inflation_x_debt_to_equity SHAP=0.00701
⛔ ppi_x_capextorevenue SHAP=0.00628
⛔ operatingCashFlow_to_totalAssets SHAP=0.00610
⛔ momentum_1y SHAP=0.00586
⛔ gross_margin SHAP=0.00575
⛔ realized_vol_3m SHAP=0.00543
⛔ long_term_debt_to_equity SHAP=0.00526
⛔ rev_grw_qoq SHAP=0.00500
⛔ vix_×_debt_to_equity SHAP=0.00480
⛔ goodwill_to_totalAssets SHAP=0.00432
⛔ cpi_x_netprofitmargin SHAP=0.00371
⛔ incomeQuality SHAP=0.00360
⛔ rev_grw_qoq_to_totalAssets SHAP=0.00341
⛔ netDebtToEBITDA SHAP=0.00340
⛔ evToFreeCashFlow SHAP=0.00304
⛔ debt_to_equity SHAP=0.00303
⛔ salesGeneralAndAdministrativeToRevenue SHAP=0.00288
⛔ vix_x_evtoebitda SHAP=0.00273
⛔ rev_grw_pop_sector_z SHAP=0.00268
⛔ eps_grw_qoq SHAP=0.00258
⛔ evToEBITDA SHAP=0.00258
⛔ interestBurden SHAP=0.00255
⛔ momentum_1y_sector_z SHAP=0.00252
⛔ evToEBITDA_sector_z SHAP=0.00237
⛔ asset_turnover SHAP=0.00235
⛔ totalEquity SHAP=0.00231
⛔ net_margin SHAP=0.00228
⛔ workingCapital SHAP=0.00218
⛔ cc_delinquency_rate_×_debt_ratio SHAP=0.00209
⛔ inventory_turnover SHAP=0.00209
⛔ capexToRevenue SHAP=0.00197
⛔ freeCashFlowPerShare SHAP=0.00191
⛔ daysOfInventoryOutstanding_sector_z SHAP=0.00186
⛔ fcf_ps_grw_qoq SHAP=0.00186
⛔ eps_grw_qoq_to_totalAssets SHAP=0.00182
⛔ daysOfInventoryOutstanding SHAP=0.00180
⛔ daysOfPayablesOutstanding SHAP=0.00179
⛔ capexToRevenue_sector_z SHAP=0.00179
⛔ operatingCashFlow SHAP=0.00173
⛔ totalEquity_to_totalAssets SHAP=0.00163
⛔ stockBasedCompensationToRevenue SHAP=0.00159
⛔ cash_ratio SHAP=0.00158
⛔ evToOperatingCashFlow SHAP=0.00156
⛔ debt_ratio SHAP=0.00156
⛔ eps_grw_4q SHAP=0.00155
⛔ receivables_turnover SHAP=0.00153
⛔ ocf_to_current_liabilities SHAP=0.00151
⛔ totalDebt SHAP=0.00146
⛔ operating_margin SHAP=0.00143
⛔ debt_ratio_sector_z SHAP=0.00143
⛔ workingCapital_to_totalAssets SHAP=0.00135
⛔ operatingReturnOnAssets SHAP=0.00130
⛔ daysOfSalesOutstanding SHAP=0.00129
⛔ ordinary_shares SHAP=0.00128
⛔ roa SHAP=0.00127
⛔ earnings_surprise SHAP=0.00122
⛔ bookValue SHAP=0.00118
⛔ totalLiabilities SHAP=0.00114
⛔ fcf_ps_grw_qoq_to_totalAssets SHAP=0.00113
⛔ goodwill SHAP=0.00112
⛔ fcf_ps_grw_4q SHAP=0.00106
⛔ eps_grw_pop_sector_z SHAP=0.00100
⛔ freeCashFlow SHAP=0.00090
⛔ current_ratio SHAP=0.00090
⛔ ebitda_margin SHAP=0.00087
⛔ totalRevenue SHAP=0.00082
⛔ quick_ratio_sector_z SHAP=0.00078
⛔ free_cash_flow SHAP=0.00073
⛔ ocf_to_total_liabilities SHAP=0.00062
⛔ sector_Energy SHAP=0.00062
⛔ returnOnTangibleAssets SHAP=0.00059
⛔ avgDilutedShares SHAP=0.00050
⛔ sector_Basic Materials SHAP=0.00034
⛔ sector_Technology SHAP=0.00025
⛔ sector_Real Estate SHAP=0.00025
⛔ sector_Financial Services SHAP=0.00019
⛔ sector_Industrials SHAP=0.00011
⛔ sector_Consumer Cyclical SHAP=0.00010
⛔ sector_Communication Services SHAP=0.00009
⛔ sector_Healthcare SHAP=0.00007
⛔ sector_Utilities SHAP=0.00004
⛔ sector_Consumer Defensive SHAP=0.00002
⛔ fcf_ps_grw_pop SHAP=0.00000
⛔ rev_grw_pop SHAP=0.00000
⛔ eps_grw_pop SHAP=0.00000
⛔ unemp_x_rev_grw SHAP=0.00000
⛔ ocf_to_net_income SHAP=0.00000
⛔ quick_ratio SHAP=0.00000
⛔ current_ratio_sector_z SHAP=0.00000

The symbol target encoding? SHAP likes it. It moves the proverbial needle.

But when it comes to the sector and industry, the one-hot encodings, neither of those seem to be adding any real benefit.

Our R-squared was better than the one we had with one-hot encoded industries - but less than our previous and best runs with NO encodings as shown below.

Symbol Target-Encoded and Sector One-Hot Encoded:
🪓 Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4708, Validation: 0.0757, Test: 0.0868
PRUNED R² -- Train: 0.3847, Validation: 0.0779, Test: 0.0486
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0868
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2154, MAE: 0.1485
Feature importance summary:
→ Total features evaluated: 78

So - one last run, with only the Symbol Target Encoding (not Sector and Industry)
🪓 Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.5169, Validation: 0.0788, Test: 0.0463
PRUNED R² -- Train: 0.2876, Validation: 0.0456, Test: 0.0083
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0463
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2154, MAE: 0.1479
Feature importance summary:
→ Total features evaluated: 78

And...back to no symbol, sector, industry
🪓 Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4876, Validation: 0.1095, Test: 0.0892
PRUNED R² -- Train: 0.2870, Validation: 0.0713, Test: 0.0547
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0892
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2188, MAE: 0.1486

Once again - no encoding at all on symbol, sector and industry...
🪓 Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4985, Validation: 0.1051, Test: 0.0969
PRUNED R² -- Train: 0.2742, Validation: 0.0981, Test: 0.0869
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0969
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2160, MAE: 0.1487
Feature importance summary:
→ Total features evaluated: 78

Conclusion: These encodings did not move the R-squared at all, and in fact moved it backwards. So I need to make sure that there is no benefit to doing this encoding and if not, I will leave those out as the model was running originally.

Grasping Technology

Monday, September 22, 2025

Target Encoding & One-Hot Encoding

No comments:

My New Stock Prediction Model - Short Term Stock Prediction

Search This Blog