Thursday, March 5, 2026

Backtesting - Decile Testing and Monotonocity - Part II

So now that we understand decile testing and monotonicity, we can run this on ALL features to see how they look.... 


And THIS is why the back-test was using "just" lag_ret_20d instead of the model predictions.

So the Macro features - just adding unnecessary noise, and no value!?

Is the News Sentiment adding any value at all? Maybe, they're ranked #4 and #8, but we would need to try to combine them to see if they add any value at all. Keep in mind also, that the residual (finbert_signed_resid) ferrets out "true" sentiment from momentum (as discussed in an earlier blog post).

And - if you combine them, in what ratio for them to make an optimal combination? Should we combine just two? Or more?

You can see where this can go. You almost need a permutational approach.  

Backtesting - Decile Testing and Monotonicity

 

The Main Backtest (what runs by default)

Uses ONLY lag_ret_20d and momentum_strength.

That's it. The cohort analysis, decile analysis, 2D grid, and portfolio simulations all just look at raw price features directly from the database. No sentiment, no macro, no model. It's purely:

  • Filter: is momentum_strength < 0 and lag_ret_20d > 5%?
  • Rank: sort by lag_ret_20d descending
  • Pick top N

The sentiment and macro features aren't even loaded in the default run. Look at the SQL query in load_from_db — it only pulls lag_ret_5d, lag_ret_20d, volatility_5d, and the return columns.

How lag_ret_20d Became the Ranking Signal

It wasn't chosen upfront. It emerged from the decile analysis in the backtest output. When all stocks are sorted by lag_ret_20d from lowest to highest and look at actual forward returns across deciles, you get this clean pattern:

D1  (avg lag20d: -26%) → ret_5d: -0.544%
D2  (avg lag20d: -13%) → ret_5d: -1.201%
...
D9  (avg lag20d: +15%) → ret_5d: +0.915%
D10 (avg lag20d: +29%) → ret_5d: +1.358%
 
 

7/9 monotonic — returns increase almost perfectly as lag_ret_20d increases. That's the definition of a good ranking signal. The data told you: stocks that have been strong over 20 days continue to be strong over the next 5 days.

When you ran the same decile test on momentum_strength you only got 4/9 monotonic — weak. And when you ran it on avg_pred (the model) you got 5/9 — moderate.

So the answer to "why lag_ret_20d" is simply: it produced the most monotonic decile separation on the holdout test set, which means it's the most reliable ranker of the three things tested.


What Hasn't Been Tested Yet

Here's the important caveat: you've only tested three ranking signals so far — momentum_strength, lag_ret_20d, and avg_pred. There could be better signals you haven't looked at yet, like volume, or lag_ret_20d combined with something else. But of the options tested, lag_ret_20d won clearly.

ok so the decile test is being run on all features, and lag_ret_20d is the one that had the best score? i don't know what monotonic means

5:16 PM

Not quite — the decile test was only run on a few features, not all of them. And let me explain monotonic in plain terms first because it's the key concept.


Monotonic in Plain English

Imagine you rank all students in a class by how much they studied, from least to most. Then you look at their test scores. If the more-they-studied, the-better-they-scored — perfectly and consistently — that's monotonic. Every step up in studying corresponds to a step up in grade.

In the backtest, you're asking: if I sort stocks by signal X from lowest to highest, do returns also go from lowest to highest? If yes, the signal is monotonic and you can trust it to rank stocks reliably.

Here's what lag_ret_20d looked like:

D1  lowest lag_ret_20d  → ret_5d: -1.201%   ← worst returns
D2                      → ret_5d: -0.544%
D3                      → ret_5d: -0.293%
...
D9                      → ret_5d: +0.915%
D10 highest lag_ret_20d → ret_5d: +1.358%   ← best returns

Almost perfectly stair-stepping up. That's what 7/9 monotonic means — 7 out of 9 consecutive steps went in the right direction.

Compare to momentum_strength which only got 4/9 — it was all over the place, stepping up then down then up. You can't rely on it to rank stocks.


Which Features Were Actually Decile-Tested

Only three were tested in the backtest:

  • momentum_strength → 4/9 ❌ weak
  • lag_ret_20d → 7/9 ✅ strong
  • avg_pred (model output) → 5/9 ⚠️ moderate

That's it. The other features — sentiment, macro, volatility, trend_consistent — were never individually decile-tested. They only appear bundled together inside the model's avg_pred.


Should You Be Testing the Others?

Yes, if you want to know whether you're leaving alpha on the table.

The right way to do this is to run the decile test on every feature individually and see which ones have the cleanest monotonic separation. You'd be looking for features that score 6/9 or better. Then you test combinations of the top performers.

The candidates sitting untested right now:

Probably worth testing:

  • lag_ret_5d — you use it to construct momentum_strength but never tested it alone
  • volatility_5d — low volatility stocks might have cleaner signals
  • finbert_signed — raw sentiment score before residualization
  • finbert_signed_resid — sentiment surprise (sentiment unexplained by momentum)

Probably less useful alone but worth checking:

  • tone_signed and tone_signed_resid
  • vix_0d — same value for all stocks on a given day so won't help rank stocks against each other, but might tell you which days to trade
  • risk_adjusted_momentum — momentum normalized by volatility, might be cleaner than raw momentum_strength

What Was Added

A new --feature-tournament flag that runs a systematic decile test on every feature and prints a ranked leaderboard. When you run it:

bash
python backtest_signal_quality.py --feature-tournament

It tests all of these in one shot:

Price/momentum: lag_ret_20d, lag_ret_5d, momentum_strength, risk_adjusted_momentum, trend_consistent, volatility_5d

Sentiment (raw): finbert_signed, tone_signed

Sentiment (residual): finbert_signed_resid, tone_signed_resid

Macro: vix_0d, treasury_spread_0d, business_confidence_0d, consumer_sentiment_0d

For each feature it reports monotonicity score, Spearman correlation, and D10-D1 spread (best decile return minus worst decile return). Then it prints a clean leaderboard ranked by monotonicity, and runs head-to-head portfolio simulations for the top 3 features.

You can also combine it with --model-lift:

bash
python backtest_signal_quality.py --feature-tournament --model-lift

The output will tell you definitively whether lag_ret_20d is actually the best signal available, or whether something else deserves to be the ranking signal in production.


The Leaderboard Summary

RankFeatureMonoSpearmanPortfolio Sharpe
#1volatility_5d6/9-0.0142.70
#2lag_ret_20d5/9+0.1207.05
#3lag_ret_5d5/9+0.0555.38
#4finbert_signed_resid5/9+0.006 (not significant)
Model (avg_pred)5/9+0.1242.88

The Volatility Problem

The tournament crowned volatility_5d as the best feature by monotonicity (6/9). But look at what actually happens when you trade it — Sharpe 2.70, worst day -6.77%, a +13.96% outlier day that saved the whole period. That's a lottery ticket strategy, not a signal. It's picking the most volatile stocks and occasionally getting lucky with a huge mover.

This exposes a flaw in using monotonicity as the sole ranking criterion. Monotonicity measures consistency of direction, not quality of risk-adjusted returns. Volatility ranked stocks happen to step up consistently across deciles, but the actual portfolio is chaotic and dangerous.

lag_ret_20d has lower monotonicity (5/9) but Sharpe 7.05 vs 2.70 — more than 2.5x better risk-adjusted performance. It's the right choice for a strategy you'd actually trade.


What the Tournament Confirms

Sentiment is weak. Raw finbert_signed and tone_signed score 2/9 — essentially noise as standalone signals. The residual versions do better (5/9 for finbert_signed_resid) but the Spearman correlation is tiny (+0.006) and statistically insignificant. Sentiment is not currently a useful standalone ranking signal.

Macro features are useless for cross-sectional ranking. VIX and treasury spread score 4/9 and business confidence/consumer sentiment score 0/9. This makes sense — they're the same number for every stock on a given day, so they can't tell you which stocks to pick. They might be useful as day-level filters ("don't trade on high VIX days") but that's a different analysis.

lag_ret_5d is a viable alternative to lag_ret_20d. Same monotonicity (5/9), similar cumulative return (+17.01% vs +17.01% — identical in this test period), but lower Sharpe (5.38 vs 7.05) and a much scarier worst day (-5.87% vs -2.40%). The 20-day window is smoother and more reliable.

The model (avg_pred) at 5/9 monotonicity is now in proper context. It ties with lag_ret_20d and lag_ret_5d on monotonicity, but produces Sharpe 2.88 vs 7.05. The model is combining features in a way that degrades the clean signal from lag_ret_20d rather than enhancing it.


Bottom Line

lag_ret_20d remains the right ranking signal — not because it won a clean tournament, but because it has the best combination of monotonicity, Spearman correlation (+0.12 and statistically significant), and actual portfolio performance. The tournament confirms there is no obvious better single feature hiding in the data that you were previously ignoring.

The one thing worth investigating further: could lag_ret_20d + finbert_signed_resid combined beat lag_ret_20d alone? Both score 5/9 and their Spearman correlations suggest they might be capturing different things. That would be a combined ranking signal test — a logical next step.

Wednesday, March 4, 2026

The News Sentiment Model is Not About News Sentiment

Quick post to discuss some things...

First, what I am discovering after training this model daily, and running predictions, is that the news sentiment is NOT what is influencing this model. 

The news sentiment is based on Transformer models for financial news (Finbert). The news IS fresh, as we get these articles and and then predict over a 1-3 day period, before they're expired out to a training database where predicted returns are compared with actual returns.

News is NOT what this model is learning and training on. Instead, momentum and macro environment features are the aspects that is influencing the returns. Part of this could be the news being stale - the market already knows the news before the model is running. Also, we are only scoring headlines - not actual articles - due to processing constraints. So there's that.

From the training output, here's the feature importance ranked across all three horizons:

Macro features (dominating ~50% of importance):

  • vix_0d — market fear index
  • treasury_spread_0d — yield curve spread
  • business_confidence_0d — FRED macro indicator
  • consumer_sentiment_0d — FRED macro indicator

Momentum features (~40%):

  • momentum_strength — short-term trend deceleration
  • risk_adjusted_momentum — momentum relative to volatility
  • trend_consistent — direction consistency

Sentiment features (~10%):

  • tone_signed_resid — residualized tone score
  • finbert_signed_resid — residualized FinBERT score
  • tone_signed — raw tone
  • finbert_signed — raw FinBERT

So the model is essentially saying: buy beaten-down stocks when the macro environment is calm. The news sentiment is contributing about 10% of the predictive signal, which is barely above noise given the R² of 0.002 on the residual models.

It's not really a news sentiment model at this point. It's a macro-regime mean-reversion model that happens to have sentiment features along for the ride.

Which raises the real question — is that actually a bad thing? Mean reversion in calm macro regimes is a legitimate strategy. The problem isn't the model's logic, it's that you only have 3 months of data so it hasn't seen enough regime diversity to be robust. And right now you're in a fear regime — elevated VIX, geopolitical uncertainty — which is exactly when this model historically underperforms.

Thursday, February 19, 2026

Fixing the Momentum Filter

 

The Problem

I noticed the model was not making ANY trades for over a week. Every day, the model generated 30+ buy signals, but the portfolio manager - which acts as a gatekeeper - blocked ALL of them due to negative momentum.

Root Cause Analysis

1. Stale momentum data:

  • Using articles from the last 10 days (now 3 days)
  • Momentum was calculated from prices 5-20 days old
  • By the time you made trading decisions, that momentum was ancient history

2. The momentum paradox:

  • momentum_strength = lag_ret_5d - lag_ret_20d
  • This measures deceleration, not absolute direction
  • Example: Stock up 20% over 20 days, then pulls back 4% in last 5 days = -24% momentum_strength
  • The filter was blocking strong stocks taking healthy pullbacks

The Backtest Evidence

Ran analysis on 31,600 test samples and found:

OLD filter (momentum_strength >= 0.1):

  • Blocked 89% of stocks
  • Blocked stocks: +0.64% avg return, 59.1% win rate ✅ BETTER
  • Allowed stocks: +0.23% avg return, 55.7% win rate ❌ WORSE

Best performing cohort (which were blocking):

  • "Strong deceleration" (<-0.15 momentum)
  • Returns: +1.05% (1d), +2.58% (3d), +2.04% (5d)
  • Win rates: 61.6%, 67.4%, 63.4%

The filter was blocking the best opportunities.

The Solution

OLD: Only buy if momentum_strength > 0

NEW: Buy if EITHER:

  1. lag_ret_20d > 10% (strong 20-day uptrend), OR
  2. 0 < momentum_strength < 15% (mild positive momentum)

Why this works (at least according to the backtest):

  • Captures pullbacks in strong uptrends (mean reversion plays)
  • Captures steady risers (not overextended)
  • Blocks actual falling knives (negative long-term trend)
  • Blocks momentum chasers at the top (>15% recent momentum)

Results

  • 28/30 signals now pass the filter (vs 0/30 before)
  • Deployed 9 trades today with the new logic
  • All are strong stocks pulling back (exactly what backtest said to buy)

Now we wait to see if these actually perform as the backtest predicted.

Wednesday, February 18, 2026

Residuals - How to Separate two Inter-related Variables in a Model

I had someone on an Algotrading site, suggest that I should try to extract how much of momentum was sentiment - and vice-versa.

I had to look into this. It took me down a road. Which - as you shall see - introduced me to the concept of Residuals. And that is what this blog subject is about.

Chicken or egg?
Is sentiment causing returns, or just reacting to them? Raw scores conflate these.

Put another way, if news sentiment causes/impacts/influences Momentum, how much of the Momentum is due to the Sentiment? 

Well, there is a way to ferret this out. It is called Residuals.  

Step 1: Regress sentiment on momentum features:

finbert_raw ~ lag_ret_20d + lag_ret_5d + volatility_5d

Step 2: Residual = what's left after stripping out price explanation:

finbert_signed_resid = finbert_raw - predicted(finbert | momentum)

 

Now, you can use the residual as a feature in the model, instead of just tossing in the raw values. Residuals are the cleanest sentiment alpha you can extract. Raw scores are 90% noise.

What does a residual capture?

✅ News sentiment BEFORE price moves (true alpha)
✅ Analyst upgrades/downgrades not reflected in price yet  
✅ Management guidance changes
✅ Product launch sentiment
❌ Earnings reactions (already in lag_ret_20d)


Tuesday, February 17, 2026

Trying to Right the Ship on my News Sentiment Based Stock Model

Jan 12th, through Feb 3rd. A downward trend that saw my balance drop from 101K to 89K.

The reasons were very complex. I will discuss them

Bugs in the trading module

Bugs in the code that put stop limits in, which caused repetitive falling knife scenario        buying.

OTC Stock Volatility

You can make a lot of money with OTC stocks, but you can also lose a lot. I removed the OTC exchanges and left just AMEX, NASDAQ and NYSE as the exchanges.

AI making changes to the algorithms I was not reviewing properly

AI had made several mistakes in moving the model from return prediction to a rank-based approach. 

In general, the consensus on discussion was that return prediction didn't make a ton of sense, and that ranking stocks based on their cumulative scores (news sentiment, momentum and other derived features) made the most sense. The problem though, was that the filtering was being done before the predictions, not AFTER. This meant that the universe of data was restricted to the model.

Article Mapping 

Phrase mapping had several bugs in it that caused completely legit articles to be unmapped. Recall that earlier, certain mapping bugs caused certain symbols to be attributed to casual words (i.e. key). 

I was mapping articles to symbols, and this was causing article-symbol-price tuples such that symbols with more news (articles) created a lot more rows of data than those that had less articles. And this was imbalancing the model.   The fix for this, was to aggregate the symbol_day prediction, which collapses multiple articles per symbol per day into one row by averaging sentiment features.

There were other changes I made as well, such as using XGBoost instead of Random Forest. The balance has bounced back up to $97,760 at the time of this writing, so a decent recovery. Of course the market has stabilized a bit - favoring Value right now. There indeed was a pullback market regime that did occur at the beginning of my slide, so that is also a contributing factor as well.

So - will these changes work? 

We shall see.
 

 

Friday, January 16, 2026

Jan 12th - The Death Spiral - Trading Lessons Learned

I learned quickly all of the issues with Algo Trading.

Lesson 1 - 4-to-1 Margins Were Enabled

I didn't take the time to look at the Configuration. So I was trading on Margin, and had to Liquidate a bunch of positions, some of which lost money.

Lesson 2 - Stop Limits Can Lead to a HOST of Problems

I had stop limit orders being set up to protect my downside. But, this led to THIS problem:

Buy HMY @ $25 → gap to $23 → stop → rebuy $23.50 
→ gap to $21 → stop → rebuy $21.80 → PDT flag → can't exit → $18

Basically, the stop limits were causing repetitive buying in falling-knife scenarios, because the model would pick the same symbol the next trade.

Lesson 3 - Some Symbols Are Halted or Delisted

I had one position, CLCO, which was halted. I have tried to force liquidate this position, but it is absolutely stuck. I will probably have to contact support. But this has thrown a major wrench into my Portfolio Manager, which needs to know that this money is "locked up" and account for that.

I also had issues with the model itself - in that the way the algorithm was working, it would pick the best ranked stocks, even if the signal strength or momentum was negative. So this had to be fixed as well, with some enhanced signal strength and momentum logic.

So - very expensive lessons. The back-testing of the model is turning out fine. But the Porfolio Manager and its transaction management was killing us. We burned about $10K of our $100K on this. Hopefully it starts to work better now with fixes to these issues.

Backtesting - Decile Testing and Monotonocity - Part II

So now that we understand decile testing and monotonicity, we can run this on ALL features to see how they look....  And THIS is why the bac...