Monday, September 22, 2025

Target Encoding & One-Hot Encoding

I had completely overlooked these concepts earlier on, and somehow not until just now - this late in the game - did they come up as I was reading some of the advanced sections of the Jansen Algorithmic Trading book.

When you run a model, there are always going to be things you do NOT want to include. For many reasons (circular correlation, redundancy, etc). Generally, you would not want to include symbol, sector and industry in your model because the symbol after all, is nothing more than an index in the data - and the symbol in and of itself should have NOTHING to do with predicting returns.

So we keep an Exclusion array of metrics we don't want included:
EXCLUDED_FEATURES = [
    # 'symbol', 
    'date', 'year', 'quarter', 'close', 'fwdreturn',  # identity + target
    # 'sector', 'industry',  optional: include via one-hot or embeddings if desired
...
]

Symbol, Sector and Industry were initially excluded because they are text strings, and models don't like text data (some won't run without only numerics, some give warnings, others unpredictable results if they tolerate it).

Then I read about target encodings for symbols - which allow you to KEEP symbols in the data, and something called "one-hot" encodings for things like sectors and industries.

What these encodings do, is they take categorical data - like sectors and industries - and they convert them to a numeric so that they will be considered by the model. This captures the association between category and target, sometimes boosting model power.

So - adding the code to do this wasn't terribly painful. The symbol was encoded as a single encoded target feature (symbol_enc) - and removed from the above-mentioned EXCLUDED_FEATURES list.  The code to one-hot encode the Sector and Industry was snapped in, and off we went...

The R-squared dropped from .11 to .06. Yikes...
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4989, Validation: 0.0776, Test: 0.0673
PRUNED R² -- Train: 0.2216, Validation: 0.0348, Test: 0.0496
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0673
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2133, MAE: 0.1471


I quickly realized, that by one-hot encoding the Industry, we got so many feature category permutations of the industries, that it undoubtedly screwed up the model.

I decided to run again, and only one-hot encode the Sector. So now with this run, we have a target encoding for symbol, and a one-hot encoding for sector. There are not that many sectors, so this doesn't become too unwieldy.

But - here is what we got on our SHAP analysis after this change:
🧮 Feature Selection Overview:
✅  consumer_sentiment_×_earnings_yield                SHAP=0.04511
✅  dilutedEPS                                         SHAP=0.02566
✅  rev_grw_4q                                         SHAP=0.01191
✅  evToSales                                          SHAP=0.01181
✅  roe                                                SHAP=0.01071
✅  symbol_enc                                         SHAP=0.01055
✅  business_confidence_×_capexToRevenue               SHAP=0.00925
✅  treasury_spread_x_debt_to_equity                   SHAP=0.00867
✅  cpippi_marginsqueeze                               SHAP=0.00719
✅  earningsYield                                      SHAP=0.00703
✅  inflation_x_debt_to_equity                         SHAP=0.00701
⛔  ppi_x_capextorevenue                               SHAP=0.00628
⛔  operatingCashFlow_to_totalAssets                   SHAP=0.00610
⛔  momentum_1y                                        SHAP=0.00586
⛔  gross_margin                                       SHAP=0.00575
⛔  realized_vol_3m                                    SHAP=0.00543
⛔  long_term_debt_to_equity                           SHAP=0.00526
⛔  rev_grw_qoq                                        SHAP=0.00500
⛔  vix_×_debt_to_equity                               SHAP=0.00480
⛔  goodwill_to_totalAssets                            SHAP=0.00432
⛔  cpi_x_netprofitmargin                              SHAP=0.00371
⛔  incomeQuality                                      SHAP=0.00360
⛔  rev_grw_qoq_to_totalAssets                         SHAP=0.00341
⛔  netDebtToEBITDA                                    SHAP=0.00340
⛔  evToFreeCashFlow                                   SHAP=0.00304
⛔  debt_to_equity                                     SHAP=0.00303
⛔  salesGeneralAndAdministrativeToRevenue             SHAP=0.00288
⛔  vix_x_evtoebitda                                   SHAP=0.00273
⛔  rev_grw_pop_sector_z                               SHAP=0.00268
⛔  eps_grw_qoq                                        SHAP=0.00258
⛔  evToEBITDA                                         SHAP=0.00258
⛔  interestBurden                                     SHAP=0.00255
⛔  momentum_1y_sector_z                               SHAP=0.00252
⛔  evToEBITDA_sector_z                                SHAP=0.00237
⛔  asset_turnover                                     SHAP=0.00235
⛔  totalEquity                                        SHAP=0.00231
⛔  net_margin                                         SHAP=0.00228
⛔  workingCapital                                     SHAP=0.00218
⛔  cc_delinquency_rate_×_debt_ratio                   SHAP=0.00209
⛔  inventory_turnover                                 SHAP=0.00209
⛔  capexToRevenue                                     SHAP=0.00197
⛔  freeCashFlowPerShare                               SHAP=0.00191
⛔  daysOfInventoryOutstanding_sector_z                SHAP=0.00186
⛔  fcf_ps_grw_qoq                                     SHAP=0.00186
⛔  eps_grw_qoq_to_totalAssets                         SHAP=0.00182
⛔  daysOfInventoryOutstanding                         SHAP=0.00180
⛔  daysOfPayablesOutstanding                          SHAP=0.00179
⛔  capexToRevenue_sector_z                            SHAP=0.00179
⛔  operatingCashFlow                                  SHAP=0.00173
⛔  totalEquity_to_totalAssets                         SHAP=0.00163
⛔  stockBasedCompensationToRevenue                    SHAP=0.00159
⛔  cash_ratio                                         SHAP=0.00158
⛔  evToOperatingCashFlow                              SHAP=0.00156
⛔  debt_ratio                                         SHAP=0.00156
⛔  eps_grw_4q                                         SHAP=0.00155
⛔  receivables_turnover                               SHAP=0.00153
⛔  ocf_to_current_liabilities                         SHAP=0.00151
⛔  totalDebt                                          SHAP=0.00146
⛔  operating_margin                                   SHAP=0.00143
⛔  debt_ratio_sector_z                                SHAP=0.00143
⛔  workingCapital_to_totalAssets                      SHAP=0.00135
⛔  operatingReturnOnAssets                            SHAP=0.00130
⛔  daysOfSalesOutstanding                             SHAP=0.00129
⛔  ordinary_shares                                    SHAP=0.00128
⛔  roa                                                SHAP=0.00127
⛔  earnings_surprise                                  SHAP=0.00122
⛔  bookValue                                          SHAP=0.00118
⛔  totalLiabilities                                   SHAP=0.00114
⛔  fcf_ps_grw_qoq_to_totalAssets                      SHAP=0.00113
⛔  goodwill                                           SHAP=0.00112
⛔  fcf_ps_grw_4q                                      SHAP=0.00106
⛔  eps_grw_pop_sector_z                               SHAP=0.00100
⛔  freeCashFlow                                       SHAP=0.00090
⛔  current_ratio                                      SHAP=0.00090
⛔  ebitda_margin                                      SHAP=0.00087
⛔  totalRevenue                                       SHAP=0.00082
⛔  quick_ratio_sector_z                               SHAP=0.00078
⛔  free_cash_flow                                     SHAP=0.00073
⛔  ocf_to_total_liabilities                           SHAP=0.00062
⛔  sector_Energy                                      SHAP=0.00062
⛔  returnOnTangibleAssets                             SHAP=0.00059
⛔  avgDilutedShares                                   SHAP=0.00050
⛔  sector_Basic Materials                             SHAP=0.00034
⛔  sector_Technology                                  SHAP=0.00025
⛔  sector_Real Estate                                 SHAP=0.00025
⛔  sector_Financial Services                          SHAP=0.00019
⛔  sector_Industrials                                 SHAP=0.00011
⛔  sector_Consumer Cyclical                           SHAP=0.00010
⛔  sector_Communication Services                      SHAP=0.00009
⛔  sector_Healthcare                                  SHAP=0.00007
⛔  sector_Utilities                                   SHAP=0.00004
⛔  sector_Consumer Defensive                          SHAP=0.00002
⛔  fcf_ps_grw_pop                                     SHAP=0.00000
⛔  rev_grw_pop                                        SHAP=0.00000
⛔  eps_grw_pop                                        SHAP=0.00000
⛔  unemp_x_rev_grw                                    SHAP=0.00000
⛔  ocf_to_net_income                                  SHAP=0.00000
⛔  quick_ratio                                        SHAP=0.00000
⛔  current_ratio_sector_z                             SHAP=0.00000

The symbol target encoding? SHAP likes it. It moves the proverbial needle.

But when it comes to the sector and industry, the one-hot encodings, neither of those seem to be adding any real benefit.

Our R-squared was better than the one we had with one-hot encoded industries - but less than our previous and best runs with NO encodings as shown below.

Symbol Target-Encoded and Sector One-Hot Encoded:
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4708, Validation: 0.0757, Test: 0.0868
PRUNED R² -- Train: 0.3847, Validation: 0.0779, Test: 0.0486
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0868
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2154, MAE: 0.1485
Feature importance summary:
  → Total features evaluated: 78

So - one last run, with only the Symbol Target Encoding (not Sector and Industry)
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.5169, Validation: 0.0788, Test: 0.0463
PRUNED R² -- Train: 0.2876, Validation: 0.0456, Test: 0.0083
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0463
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2154, MAE: 0.1479
Feature importance summary:
  → Total features evaluated: 78

And...back to no symbol, sector, industry
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4876, Validation: 0.1095, Test: 0.0892
PRUNED R² -- Train: 0.2870, Validation: 0.0713, Test: 0.0547
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0892
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2188, MAE: 0.1486

Once again - no encoding at all on symbol, sector and industry...
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4985, Validation: 0.1051, Test: 0.0969
PRUNED R² -- Train: 0.2742, Validation: 0.0981, Test: 0.0869
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0969
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2160, MAE: 0.1487
Feature importance summary:
  → Total features evaluated: 78

Conclusion: These encodings did not move the R-squared at all, and in fact moved it backwards. So I need to make sure that there is no benefit to doing this encoding and if not, I will leave those out as the model was running originally.

No comments:

Target Encoding & One-Hot Encoding

I had completely overlooked these concepts earlier on, and somehow not until just now - this late in the game - did they come up as I was re...