The Main Backtest (what runs by default)
Uses ONLY lag_ret_20d and momentum_strength.
That's it. The cohort analysis, decile analysis, 2D grid, and portfolio simulations all just look at raw price features directly from the database. No sentiment, no macro, no model. It's purely:
- Filter: is
momentum_strength < 0andlag_ret_20d > 5%? - Rank: sort by
lag_ret_20ddescending - Pick top N
The sentiment and macro features aren't even loaded in the default run. Look at the SQL query in load_from_db — it only pulls lag_ret_5d, lag_ret_20d, volatility_5d, and the return columns.
How lag_ret_20d Became the Ranking Signal
It wasn't chosen upfront. It emerged from the decile analysis in the backtest output. When all stocks are sorted by lag_ret_20d from lowest to highest and look at actual forward returns across deciles, you get this clean pattern:
D2 (avg lag20d: -13%) → ret_5d: -1.201%
...
D9 (avg lag20d: +15%) → ret_5d: +0.915%
D10 (avg lag20d: +29%) → ret_5d: +1.358%
7/9 monotonic — returns increase almost perfectly as lag_ret_20d increases. That's the definition of a good ranking signal. The data told you: stocks that have been strong over 20 days continue to be strong over the next 5 days.
When you ran the same decile test on momentum_strength you only got 4/9 monotonic — weak. And when you ran it on avg_pred (the model) you got 5/9 — moderate.
So the answer to "why lag_ret_20d" is simply: it produced the most monotonic decile separation on the holdout test set, which means it's the most reliable ranker of the three things tested.
What Hasn't Been Tested Yet
Here's the important caveat: you've only tested three ranking signals so far — momentum_strength, lag_ret_20d, and avg_pred. There could be better signals you haven't looked at yet, like volume, or lag_ret_20d combined with something else. But of the options tested, lag_ret_20d won clearly.
Not quite — the decile test was only run on a few features, not all of them. And let me explain monotonic in plain terms first because it's the key concept.
Monotonic in Plain English
Imagine you rank all students in a class by how much they studied, from least to most. Then you look at their test scores. If the more-they-studied, the-better-they-scored — perfectly and consistently — that's monotonic. Every step up in studying corresponds to a step up in grade.
In the backtest, you're asking: if I sort stocks by signal X from lowest to highest, do returns also go from lowest to highest? If yes, the signal is monotonic and you can trust it to rank stocks reliably.
Here's what lag_ret_20d looked like:
D1 lowest lag_ret_20d → ret_5d: -1.201% ← worst returns
D2 → ret_5d: -0.544%
D3 → ret_5d: -0.293%
...
D9 → ret_5d: +0.915%
D10 highest lag_ret_20d → ret_5d: +1.358% ← best returnsAlmost perfectly stair-stepping up. That's what 7/9 monotonic means — 7 out of 9 consecutive steps went in the right direction.
Compare to momentum_strength which only got 4/9 — it was all over the place, stepping up then down then up. You can't rely on it to rank stocks.
Which Features Were Actually Decile-Tested
Only three were tested in the backtest:
momentum_strength→ 4/9 ❌ weaklag_ret_20d→ 7/9 ✅ strongavg_pred(model output) → 5/9 ⚠️ moderate
That's
it. The other features — sentiment, macro, volatility, trend_consistent
— were never individually decile-tested. They only appear bundled
together inside the model's avg_pred.
Should You Be Testing the Others?
Yes, if you want to know whether you're leaving alpha on the table.
The right way to do this is to run the decile test on every feature individually and see which ones have the cleanest monotonic separation. You'd be looking for features that score 6/9 or better. Then you test combinations of the top performers.
The candidates sitting untested right now:
Probably worth testing:
lag_ret_5d— you use it to construct momentum_strength but never tested it alonevolatility_5d— low volatility stocks might have cleaner signalsfinbert_signed— raw sentiment score before residualizationfinbert_signed_resid— sentiment surprise (sentiment unexplained by momentum)
Probably less useful alone but worth checking:
tone_signedandtone_signed_residvix_0d— same value for all stocks on a given day so won't help rank stocks against each other, but might tell you which days to traderisk_adjusted_momentum— momentum normalized by volatility, might be cleaner than raw momentum_strength
What Was Added
A new --feature-tournament flag that runs a systematic decile test on every feature and prints a ranked leaderboard. When you run it:
python backtest_signal_quality.py --feature-tournamentIt tests all of these in one shot:
Price/momentum: lag_ret_20d, lag_ret_5d, momentum_strength, risk_adjusted_momentum, trend_consistent, volatility_5d
Sentiment (raw): finbert_signed, tone_signed
Sentiment (residual): finbert_signed_resid, tone_signed_resid
Macro: vix_0d, treasury_spread_0d, business_confidence_0d, consumer_sentiment_0d
For each feature it reports monotonicity score, Spearman correlation, and D10-D1 spread (best decile return minus worst decile return). Then it prints a clean leaderboard ranked by monotonicity, and runs head-to-head portfolio simulations for the top 3 features.
You can also combine it with --model-lift:
python backtest_signal_quality.py --feature-tournament --model-liftThe output will tell you definitively whether lag_ret_20d is actually the best signal available, or whether something else deserves to be the ranking signal in production.
The Leaderboard Summary
| Rank | Feature | Mono | Spearman | Portfolio Sharpe |
|---|---|---|---|---|
| #1 | volatility_5d | 6/9 | -0.014 | 2.70 |
| #2 | lag_ret_20d | 5/9 | +0.120 | 7.05 |
| #3 | lag_ret_5d | 5/9 | +0.055 | 5.38 |
| #4 | finbert_signed_resid | 5/9 | +0.006 (not significant) | — |
| Model (avg_pred) | — | 5/9 | +0.124 | 2.88 |
The Volatility Problem
The tournament crowned volatility_5d
as the best feature by monotonicity (6/9). But look at what actually
happens when you trade it — Sharpe 2.70, worst day -6.77%, a +13.96%
outlier day that saved the whole period. That's a lottery ticket
strategy, not a signal. It's picking the most volatile stocks and
occasionally getting lucky with a huge mover.
This exposes a flaw in using monotonicity as the sole ranking criterion. Monotonicity measures consistency of direction, not quality of risk-adjusted returns. Volatility ranked stocks happen to step up consistently across deciles, but the actual portfolio is chaotic and dangerous.
lag_ret_20d
has lower monotonicity (5/9) but Sharpe 7.05 vs 2.70 — more than 2.5x
better risk-adjusted performance. It's the right choice for a strategy
you'd actually trade.
What the Tournament Confirms
Sentiment is weak. Raw finbert_signed and tone_signed score 2/9 — essentially noise as standalone signals. The residual versions do better (5/9 for finbert_signed_resid)
but the Spearman correlation is tiny (+0.006) and statistically
insignificant. Sentiment is not currently a useful standalone ranking
signal.
Macro features are useless for cross-sectional ranking. VIX and treasury spread score 4/9 and business confidence/consumer sentiment score 0/9. This makes sense — they're the same number for every stock on a given day, so they can't tell you which stocks to pick. They might be useful as day-level filters ("don't trade on high VIX days") but that's a different analysis.
lag_ret_5d is a viable alternative to lag_ret_20d.
Same monotonicity (5/9), similar cumulative return (+17.01% vs +17.01% —
identical in this test period), but lower Sharpe (5.38 vs 7.05) and a
much scarier worst day (-5.87% vs -2.40%). The 20-day window is smoother
and more reliable.
The model (avg_pred) at 5/9 monotonicity is now in proper context. It ties with lag_ret_20d and lag_ret_5d
on monotonicity, but produces Sharpe 2.88 vs 7.05. The model is
combining features in a way that degrades the clean signal from lag_ret_20d rather than enhancing it.
Bottom Line
lag_ret_20d remains the right ranking signal — not because it won a clean tournament, but because it has the best combination
of monotonicity, Spearman correlation (+0.12 and statistically
significant), and actual portfolio performance. The tournament confirms
there is no obvious better single feature hiding in the data that you
were previously ignoring.
The one thing worth investigating further: could lag_ret_20d + finbert_signed_resid combined beat lag_ret_20d
alone? Both score 5/9 and their Spearman correlations suggest they
might be capturing different things. That would be a combined ranking
signal test — a logical next step.
No comments:
Post a Comment