Wednesday, October 15, 2025

My New Stock Prediction Model - Short Term Stock Prediction

Someone I work with has been working extensively on a Swing Trading model.

He has great financial experience from what I understand, but as he is in more of a management role and not in a day-to-day technical role, his programming skills might be just a tad or a step behind mine.

I have been watching him publish his short-term predictions, and his model is based on all kinds of things. I won't publish his secret sauce here, but he is using things that day traders tend to use, like RSI and Stochastics and such.

But, like my Financial Statement model - which tried to predict longer term buy-hold stocks,  his wasn't holding water either through back testing and results. One problem is that everything looks great in a bull market (rising tides lift all boats). And, to quote Buffet, "only when the tide goes out can you see who is swimming naked". So these models need to hold up in BOTH upturns and downturns.

I have decided to work on a new stock picking model. I am not sure yet how much I will blog about the specifics of it. But it is also a short-term model.

Stay Tuned 

Tuesday, September 30, 2025

The Financial Statement Model - Retired for Now

Once I got my Stock Prediction based on Annual (10-K) and Quarterly (10-Q) statement model working, I just wasn't happy with the R-squared on it. And I didn't feel comfortable investing in the picks it made (based on predicted returns). 

The R-squared on quarterly was so low, that trying to consider stocks it predicted for a quarter-long buy hold was just not feasible.

The R-squared on annual was considerably higher. But even then, it was not high enough to justify a stock purchase for a year-long tie-up of investment money.

Frankly, the stocks it was picking looked horrendous in many respects. Falling Knives, despite efforts to contain those, dominated the list. Others had low liquidity (read my earlier post on the Liquidity Effect) - and Solvency was an issue on them. Buying stocks with low or no liquidity and practically insolvent, and trying to hold them even a quarter, no less a year, is absolutely stupid.

I did Ensemble these models. But it didn't change the picture for me. And remember, I have Macro data and Macro Interactives in this model!

The conclusion: 
Statements (fundamentals) are important - but not for picking stocks based on them necessarily. You would have to combine the fundamentals with other things. 

I kind of knew this already, based on things I had read. I guess I needed to use the effort as a proving ground to myself.

So - in the end - I shelved these models. I learned a TON and it was great doing them. It built me into an AI Powerhouse with solid fundamentals in Quant Finance, an thorough understanding of Data Science and ML/AI algorithms, statistics, beefed-up math skills, etc.

I will move on. 

Monday, September 22, 2025

Target Encoding & One-Hot Encoding

I had completely overlooked these concepts earlier on, and somehow not until just now - this late in the game - did they come up as I was reading some of the advanced sections of the Jansen Algorithmic Trading book.

When you run a model, there are always going to be things you do NOT want to include. For many reasons (circular correlation, redundancy, etc). Generally, you would not want to include symbol, sector and industry in your model because the symbol after all, is nothing more than an index in the data - and the symbol in and of itself should have NOTHING to do with predicting returns.

So we keep an Exclusion array of metrics we don't want included:
EXCLUDED_FEATURES = [
    # 'symbol', 
    'date', 'year', 'quarter', 'close', 'fwdreturn',  # identity + target
    # 'sector', 'industry',  optional: include via one-hot or embeddings if desired
...
]

Symbol, Sector and Industry were initially excluded because they are text strings, and models don't like text data (some won't run without only numerics, some give warnings, others unpredictable results if they tolerate it).

Then I read about target encodings for symbols - which allow you to KEEP symbols in the data, and something called "one-hot" encodings for things like sectors and industries.

What these encodings do, is they take categorical data - like sectors and industries - and they convert them to a numeric so that they will be considered by the model. This captures the association between category and target, sometimes boosting model power.

So - adding the code to do this wasn't terribly painful. The symbol was encoded as a single encoded target feature (symbol_enc) - and removed from the above-mentioned EXCLUDED_FEATURES list.  The code to one-hot encode the Sector and Industry was snapped in, and off we went...

The R-squared dropped from .11 to .06. Yikes...
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4989, Validation: 0.0776, Test: 0.0673
PRUNED R² -- Train: 0.2216, Validation: 0.0348, Test: 0.0496
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0673
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2133, MAE: 0.1471


I quickly realized, that by one-hot encoding the Industry, we got so many feature category permutations of the industries, that it undoubtedly screwed up the model.

I decided to run again, and only one-hot encode the Sector. So now with this run, we have a target encoding for symbol, and a one-hot encoding for sector. There are not that many sectors, so this doesn't become too unwieldy.

But - here is what we got on our SHAP analysis after this change:
🧮 Feature Selection Overview:
✅  consumer_sentiment_×_earnings_yield                SHAP=0.04511
✅  dilutedEPS                                         SHAP=0.02566
✅  rev_grw_4q                                         SHAP=0.01191
✅  evToSales                                          SHAP=0.01181
✅  roe                                                SHAP=0.01071
✅  symbol_enc                                         SHAP=0.01055
✅  business_confidence_×_capexToRevenue               SHAP=0.00925
✅  treasury_spread_x_debt_to_equity                   SHAP=0.00867
✅  cpippi_marginsqueeze                               SHAP=0.00719
✅  earningsYield                                      SHAP=0.00703
✅  inflation_x_debt_to_equity                         SHAP=0.00701
⛔  ppi_x_capextorevenue                               SHAP=0.00628
⛔  operatingCashFlow_to_totalAssets                   SHAP=0.00610
⛔  momentum_1y                                        SHAP=0.00586
⛔  gross_margin                                       SHAP=0.00575
⛔  realized_vol_3m                                    SHAP=0.00543
⛔  long_term_debt_to_equity                           SHAP=0.00526
⛔  rev_grw_qoq                                        SHAP=0.00500
⛔  vix_×_debt_to_equity                               SHAP=0.00480
⛔  goodwill_to_totalAssets                            SHAP=0.00432
⛔  cpi_x_netprofitmargin                              SHAP=0.00371
⛔  incomeQuality                                      SHAP=0.00360
⛔  rev_grw_qoq_to_totalAssets                         SHAP=0.00341
⛔  netDebtToEBITDA                                    SHAP=0.00340
⛔  evToFreeCashFlow                                   SHAP=0.00304
⛔  debt_to_equity                                     SHAP=0.00303
⛔  salesGeneralAndAdministrativeToRevenue             SHAP=0.00288
⛔  vix_x_evtoebitda                                   SHAP=0.00273
⛔  rev_grw_pop_sector_z                               SHAP=0.00268
⛔  eps_grw_qoq                                        SHAP=0.00258
⛔  evToEBITDA                                         SHAP=0.00258
⛔  interestBurden                                     SHAP=0.00255
⛔  momentum_1y_sector_z                               SHAP=0.00252
⛔  evToEBITDA_sector_z                                SHAP=0.00237
⛔  asset_turnover                                     SHAP=0.00235
⛔  totalEquity                                        SHAP=0.00231
⛔  net_margin                                         SHAP=0.00228
⛔  workingCapital                                     SHAP=0.00218
⛔  cc_delinquency_rate_×_debt_ratio                   SHAP=0.00209
⛔  inventory_turnover                                 SHAP=0.00209
⛔  capexToRevenue                                     SHAP=0.00197
⛔  freeCashFlowPerShare                               SHAP=0.00191
⛔  daysOfInventoryOutstanding_sector_z                SHAP=0.00186
⛔  fcf_ps_grw_qoq                                     SHAP=0.00186
⛔  eps_grw_qoq_to_totalAssets                         SHAP=0.00182
⛔  daysOfInventoryOutstanding                         SHAP=0.00180
⛔  daysOfPayablesOutstanding                          SHAP=0.00179
⛔  capexToRevenue_sector_z                            SHAP=0.00179
⛔  operatingCashFlow                                  SHAP=0.00173
⛔  totalEquity_to_totalAssets                         SHAP=0.00163
⛔  stockBasedCompensationToRevenue                    SHAP=0.00159
⛔  cash_ratio                                         SHAP=0.00158
⛔  evToOperatingCashFlow                              SHAP=0.00156
⛔  debt_ratio                                         SHAP=0.00156
⛔  eps_grw_4q                                         SHAP=0.00155
⛔  receivables_turnover                               SHAP=0.00153
⛔  ocf_to_current_liabilities                         SHAP=0.00151
⛔  totalDebt                                          SHAP=0.00146
⛔  operating_margin                                   SHAP=0.00143
⛔  debt_ratio_sector_z                                SHAP=0.00143
⛔  workingCapital_to_totalAssets                      SHAP=0.00135
⛔  operatingReturnOnAssets                            SHAP=0.00130
⛔  daysOfSalesOutstanding                             SHAP=0.00129
⛔  ordinary_shares                                    SHAP=0.00128
⛔  roa                                                SHAP=0.00127
⛔  earnings_surprise                                  SHAP=0.00122
⛔  bookValue                                          SHAP=0.00118
⛔  totalLiabilities                                   SHAP=0.00114
⛔  fcf_ps_grw_qoq_to_totalAssets                      SHAP=0.00113
⛔  goodwill                                           SHAP=0.00112
⛔  fcf_ps_grw_4q                                      SHAP=0.00106
⛔  eps_grw_pop_sector_z                               SHAP=0.00100
⛔  freeCashFlow                                       SHAP=0.00090
⛔  current_ratio                                      SHAP=0.00090
⛔  ebitda_margin                                      SHAP=0.00087
⛔  totalRevenue                                       SHAP=0.00082
⛔  quick_ratio_sector_z                               SHAP=0.00078
⛔  free_cash_flow                                     SHAP=0.00073
⛔  ocf_to_total_liabilities                           SHAP=0.00062
⛔  sector_Energy                                      SHAP=0.00062
⛔  returnOnTangibleAssets                             SHAP=0.00059
⛔  avgDilutedShares                                   SHAP=0.00050
⛔  sector_Basic Materials                             SHAP=0.00034
⛔  sector_Technology                                  SHAP=0.00025
⛔  sector_Real Estate                                 SHAP=0.00025
⛔  sector_Financial Services                          SHAP=0.00019
⛔  sector_Industrials                                 SHAP=0.00011
⛔  sector_Consumer Cyclical                           SHAP=0.00010
⛔  sector_Communication Services                      SHAP=0.00009
⛔  sector_Healthcare                                  SHAP=0.00007
⛔  sector_Utilities                                   SHAP=0.00004
⛔  sector_Consumer Defensive                          SHAP=0.00002
⛔  fcf_ps_grw_pop                                     SHAP=0.00000
⛔  rev_grw_pop                                        SHAP=0.00000
⛔  eps_grw_pop                                        SHAP=0.00000
⛔  unemp_x_rev_grw                                    SHAP=0.00000
⛔  ocf_to_net_income                                  SHAP=0.00000
⛔  quick_ratio                                        SHAP=0.00000
⛔  current_ratio_sector_z                             SHAP=0.00000

The symbol target encoding? SHAP likes it. It moves the proverbial needle.

But when it comes to the sector and industry, the one-hot encodings, neither of those seem to be adding any real benefit.

Our R-squared was better than the one we had with one-hot encoded industries - but less than our previous and best runs with NO encodings as shown below.

Symbol Target-Encoded and Sector One-Hot Encoded:
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4708, Validation: 0.0757, Test: 0.0868
PRUNED R² -- Train: 0.3847, Validation: 0.0779, Test: 0.0486
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0868
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2154, MAE: 0.1485
Feature importance summary:
  → Total features evaluated: 78

So - one last run, with only the Symbol Target Encoding (not Sector and Industry)
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.5169, Validation: 0.0788, Test: 0.0463
PRUNED R² -- Train: 0.2876, Validation: 0.0456, Test: 0.0083
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0463
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2154, MAE: 0.1479
Feature importance summary:
  → Total features evaluated: 78

And...back to no symbol, sector, industry
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4876, Validation: 0.1095, Test: 0.0892
PRUNED R² -- Train: 0.2870, Validation: 0.0713, Test: 0.0547
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0892
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2188, MAE: 0.1486

Once again - no encoding at all on symbol, sector and industry...
🪓  Pruning all splits - train, val, test - to the pruned set of features (final_features).
Training pruned model...
FULL R² -- Train: 0.4985, Validation: 0.1051, Test: 0.0969
PRUNED R² -- Train: 0.2742, Validation: 0.0981, Test: 0.0869
Selected FULL model based on test R².
Previous best test R²: 0.1126
Current model test R²: 0.0969
Current model did NOT exceed best. Loading previous best model and features.
Final Model Test Metrics -- R²: 0.1126, RMSE: 0.2160, MAE: 0.1487
Feature importance summary:
  → Total features evaluated: 78

Conclusion: These encodings did not move the R-squared at all, and in fact moved it backwards. So I need to make sure that there is no benefit to doing this encoding and if not, I will leave those out as the model was running originally.

I Have More Data Now - Enough for an AI RNN LSTM Model?

I have a LOT more data now than I did before. And an advanced architecture to process it.

Should I consider an RNN?

I knew I couldn't really pull it off with the Annual data I had - because by the time you split the data for training, validation, and testing there isn't enough to feed the algorithm.  

But - Quarterly! I have a LOT of quarterly data now, many statements per symbol across quarter-dates. ~70K rows of data!!!

So let's try doing an LSTM....I wrote a standalone LSTM, using Keras. Just a few lines of code. 

One important note about this! 

Do NOT mix your data processing, and or your XGBoost code, with neural network code!!! ALWAYS create a brand new virtual environment for your neural RNN code, because if you choose Keras or the other competing frameworks, they will require specific versions of Python libraries that may conflict with your data processing and/or XGBoost libraries!

Now. With that important disclaimer, the small sample of code. We will highlight in blue since my blog tool apparently has no code block format.

# -------------------------
# Train/test split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

# -------------------------
# Build LSTM model
# -------------------------
model = Sequential()
model.add(LSTM(32, input_shape=(SEQ_LEN, len(feature_cols)), return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dense(1))  # regression output

model.compile(optimizer='adam', loss='mse')

# Early stopping
es = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# -------------------------
# Train
# -------------------------
history = model.fit(
    X_train, y_train,
    validation_split=0.1,
    epochs=50,
    batch_size=16,
    callbacks=[es],
    verbose=1
)

# -------------------------
# Evaluate
# -------------------------
y_pred = model.predict(X_test).flatten()
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"R²: {r2:.3f}, MAE: {mae:.3f}")

Well, how did it go?


The previous XGBoost r-squared value, was .11-.12 consistently. Now, we are getting .17-.19. This is a noticeable significant improvement!
 

Changing the Ensemble Model to a Stacked Meta Ensemble

 
Earlier we had a weighted ensemble model that essentially took the r-squared values of Annual and Quarterly and used that as a weighting factor to ensemble them.

It was here, that  realized we were not calculating or saving the predicted fwd return - we were only calculating scores, writing them to a scoring summary and saving the R-squared.

So I changed things around. I added a stacked meta ensemble, and will describe how these work below. We now run BOTH of these.

Weighted Ensemble

  • A simple blend of the two base models.
  •  Annual and quarterly predictions are combined with weights proportional to their out-of-sample R² performance.

Result: ensemble_pred_fwdreturn and ensemble_pred_fwdreturn_pct.

This improves stability but is still fairly “rigid.”


Meta-Model Ensemble (Stacked Ensemble)

A second-level model (XGBoost) is trained on:

  1. Predictions from the annual model
  2. Predictions from the quarterly model
  3. Additional features (sector, industry, etc.)

This meta-model learns the optimal way to combine signals dynamically rather than relying on fixed weights.

Result: ensemble_pred_fwdreturn_meta and ensemble_pred_fwdreturn_meta_pct.

How well did it work?
Results

  1. Weighted Ensemble: R² ~0.19, Spearman ~0.50
  2. Meta-Model Ensemble: R² ~0.75, Spearman ~0.65

Quintile backtests confirm a strong monotonic relationship between predicted quintiles and realized forward returns.

Friday, September 5, 2025

I Need More Financial Quant Data - Techniques On How To Get It

I may have posted earlier about how finding enough data - for free - is extreeeemely difficult.

Even if you can find it, ensuring the integrity of it can cost time money and cycles that make it so much simpler to just let someone else deal with it and just subscribe.  Problem is, I am "subscriptioned out". I can't keep adding layers upon layers of subscriptions, because that money adds up.

So - I work hard to see what data is available out there (i.e. Kaggle). It makes no sense to waste processing cycles and bandwidth if someone has already cultivated that data and is willing to share it.

I also have learned that there are a lot of bots out there that screen-scrape, using tools like Beautiful Soup. And if you are clever enough to use layers of (secure - I can't stress that enough) proxies, and morph your digital fingerprint (i.e. changing up browser headers and such), you can go out there and find data, and save it - and even check the integrity of the data by checking it against a couple or three sources. 

And don't forget rate-limiting and Cloudflare tools - you have to figure out how to evade those as well. It's a chess game, and one that seemingly never ends.

Anyway - I decided I needed quarterly data. My XGBoost model just wasn't computing the way I wanted. I added more interactive features from macro data, and even a "graph score" (see earlier posts). And indeed, the score - the R-squared score - came up - but it didn't get to where I wanted it, and the list of stock picks were not stocks that I would personally make an investment in.

I decided to do two things:

  1. Find superior data source(s) - preferably where I could get more and better quarterly data - for free.
  2. Consolidate the code so that I didn't have to manage and sync code that was fetching on one frequency (annual) vs another.

I underestimated these tasks. Greatly.

I found a Github project that could hit different data sources. It had an OO design - and was probably over-engineered IMHO. But, I got what the author was after - by using a base class and then plugging in different "interface classes", you could maybe switch back and forth between different data sources. 

So I tried it. And, lo and behold it didn't work. At first it did - for annual statements. But after I downloaded about 8k quarterly statements, I was horrified to realize that all of the quarterly statements were clones of the annual statements. Wow what a waste!!!

I checked - and the quarterly data was there indeed. The Github code was flawed. So, I fixed it. And even enhanced it. 

This is the first time I have actually contributed to a community Github project. I am familiar with Git and Github, but if you are not doing this kind of thing on the regular, you have to re-learn topics such as branch development, Pull Requests, Merges, etc. And perhaps one of the most annoying things, is that the upstream owner of the repository may not like or agree with your changes.  

In this particular case,  the repo owner was using property decorators. Well, those work fine if you don't have parameters in your functions, because when you try to reference attributes of a class, it doesn't work if the calls have parameters in them. I had to blow those out. He didn't seem happy about it - but, eventually, he seemed to acknowledge the need to do that.  Another difference of opinion had to do with the fact that he was using a lru_cache decorator on calls. I wasn't up to speed on this, and had to read up on it, and concluded that this was NOT the right situation to use caching, let alone lru caching. It can speed things up TREMENDOUSLY in the right use cases - but if you are batch downloading thousands of statements for thousands of symbols, you are not going to need to consult a cache for every symbol, so a cache like that actually creates overhead - and risk (i.e. running out of resources like memory if you don't have a max size on the cache).

In the end, I have some code that works. I had to do a rebase and update the pull request, and, if he doesn't take these changes the way I wrote them and need them, I guess I can always just create my own repo and go solo on this.  I would rather not, because the repo owner does synch his repository with the pip installer which makes it easy to download and update. 

  

Thursday, August 28, 2025

Latest Changes to XGBoost Quant Financial Model

 

Latest Changes:

Added new Macros to my model  - CPI, PPI, VIX (last change I made was to add a beats/meets/misses surprise score a few weeks ago)

I added some interactive features based on these (5 in total). I have learned that these interactives move the model predictability like nothing else - which is why I added more. 

    R-squared score came back up to approaching .3 now with these.

    My correlation is still inverted from 1 yr fwd return - even more so. So I flip the score.

One major change was that I added a "graph_score" which was a nightmare to produce. LLMs can NOT seem to handle this task AT ALL. So I finally had to do it mostly myself, and got something working fairly well - it recognizes good graphs vs bad graphs and downscores bad graph patterns.

I forked the stockdex github project, and am making some changes that I will re-submit back with a git pull. The macrotrends datasource could not handle quarterly data. Once I realized it could be done, I decided to enhance the code to do this so I could run the model and get more quarterly statements alongside the pack of annual statements.  

Once I get the model run with annual+quarterly, I will probably retire this project and move onto more recent and LLM-based stuff.

My New Stock Prediction Model - Short Term Stock Prediction

Someone I work with has been working extensively on a Swing Trading model. He has great financial experience from what I understand, but as ...