Status: Not shipped. Negative result (2026-05-29). The hand-set
α = 0.05 offset (Model 16) beats a per-fold fitted α on median Brier.
The shipped offset stays as-is.
Backtest date: 2026-05-29
Reproducer: scripts/backtest_composite_fitted_alpha.py --today 2026-05-29
Code: scripts/backtest_composite_fitted_alpha.py (new)
Tests: tests/test_composite_fitted_alpha.py
Sidecar output: data/wc2026/composite_fitted_alpha_backtest.json (gitignored)
Hypothesis
The player-composite differential is the lineup-aware team-strength
signal the improvements roadmap §2.3
flags as the highest-leverage item in the literature. Today it enters
the model only as a hand-set multiplicative log-rate offset
("Model 16"): for a fixture, Δ = α · (composite_home − composite_away)
is added to home log-λ (and subtracted from away log-λ), with the
single coefficient α = 0.05 chosen by the coarse 5-point grid
{0, 0.005, 0.01, 0.02, 0.05} in scripts/backtest_composite_offset.py.
Two weaknesses motivated re-fitting α as a first-class feature:
- The grid is right-censored. 0.05 is the largest candidate and the grid's Brier improvement is monotone increasing in α — so the true optimum could lie off-grid (≥ 0.05). A continuous fit reveals whether it does.
- One global α, mild look-ahead. The grid picks a single α by aggregating across all eight walks, so each walk's chosen α "sees" the others. A strictly in-fold fit (validation slice pre-cutoff only, evaluate on the post-cutoff holdout) is the honest test of whether the composite differential generalises as a fitted feature.
Design — why a single continuous α (lowest capacity)
The prior art on this codebase is unambiguous about capacity at this sample size:
- The ~20-covariate gradient-boosting meta-learner
(
fit_ensemble_meta.py) overfit and lost badly (0.533 vs 0.503). - Even the 3-parameter Bayesian stacker (
fit_stacking_weights.py) tied uniform and lost by ~1bp on median Brier.
So this experiment fits the single lowest-capacity parametric form
available: one continuous scalar α, refit per fold by 1-D Brier
minimisation (scipy.optimize.minimize_scalar, bounded [0, 0.30]).
This is strictly lower capacity than the rejected alternatives, and
lower than the two other variants considered:
- Tier-conditional α (separate coefficient per tournament tier) was rejected up front: the composite offset only fires on ≈229 holdout matches across 8 walks (see "The binding constraint" below), so splitting by tier would leave a handful of matches per tier per fold — guaranteed overfit.
- A logistic/ordinal-logit blend term adds a second mixing weight
and another calibration stage for no structural gain over the offset
that is already wired into
predict_match.
Method
Per walk (8 folds × 90-day eval windows, identical to the existing backtest harness):
- Refit DC + HP on data strictly before the walk's cutoff
(
backtest_models.fit_models_pre_cutoff). - Fit α in-fold. Carve the 180-day window immediately before the cutoff as a validation slice — still pre-cutoff, no look-ahead into the eval window. Minimise the DC+HP uniform-mean 3-class Brier on that slice's composite-covered matches over α ∈ [0, 0.30]. If the slice has < 8 composite-covered matches, fall back to α = 0.05 (so a thin fold can't emit a wild α; in this run no fold fell back).
- Evaluate three coefficient policies on the post-cutoff holdout, all on the same DC/HP fit and the same eval rows: α = 0 (offset off), α = 0.05 (the shipped incumbent), α = α_fit.
Metrics are on the DC+HP uniform mean (Elo excluded — the offset's
natural target, matching backtest_composite_offset.py, so the small
offset effect isn't diluted by an unrelated component).
Harness fidelity check. The α = 0 and α = 0.05 baselines computed by
this script match backtest_composite_offset.py bit-for-bit
(0.514023 and 0.511581 median Brier respectively), confirming no
methodology drift — the three-way comparison is apples-to-apples.
The binding constraint: the offset fires on only ~12% of matches
The composite covers only the 48 WC2026 qualifiers. The offset fires only when both teams carry a composite. Across the 8-walk holdout that is 229 of 1,970 matches (11.6%) — and one walk has only 2 affected matches:
| Walk eval window | n_holdout | both-WC (offset fires) |
|---|---|---|
| 2024-06-08 → 2024-09-06 | 265 | 44 |
| 2024-09-06 → 2024-12-05 | 485 | 42 |
| 2024-12-05 → 2025-03-05 | 66 | 2 |
| 2025-03-05 → 2025-06-03 | 202 | 16 |
| 2025-06-03 → 2025-09-01 | 214 | 31 |
| 2025-09-01 → 2025-11-30 | 502 | 49 |
| 2025-11-30 → 2026-02-28 | 89 | 14 |
| 2026-02-28 → 2026-05-29 | 147 | 31 |
On the other ~88% of matches the offset is a literal no-op, so every α produces identical predictions. The validation-slice that drives the in-fold fit is similarly thin (18–86 covered matches). At that effective n the in-fold Brier surface in α is nearly flat and dominated by noise.
Result — gate FAILED
Three-way comparison, DC+HP uniform mean, median across 8 walks:
| Policy | median Brier | mean log-loss | mean ECE |
|---|---|---|---|
| α = 0 (offset off) | 0.514023 | 0.860648 | 8.24 pp |
| α = 0.05 (incumbent) | 0.511581 | 0.860347 | 8.60 pp |
| α = fitted (per-fold) | 0.513901 | 0.860791 | 8.11 pp |
Gate (fitted vs the INCUMBENT α = 0.05): median Brier 0.513901 vs 0.511581 — the fitted variant is 23.2 bp WORSE, so the strictly-lower-Brier half of the conjunction fails outright. (The fitted variant's ECE is better, 6.93pp vs 7.54pp median, but that's moot once Brier fails.) GATE FAILED.
Per-fold fitted α and the eval-window Brier each policy produced:
| Walk | n_val_cov | α_fit | Brier α=0 | Brier α=0.05 | Brier α_fit |
|---|---|---|---|---|---|
| 2024-06→09 | 49 | 0.0001 | 0.526980 | 0.526118 | 0.526978 |
| 2024-09→12 | 67 | 0.0270 | 0.517776 | 0.517576 | 0.517537 |
| 2024-12→03 | 86 | 0.0405 | 0.566370 | 0.566305 | 0.566288 |
| 2025-03→06 | 44 | 0.0291 | 0.478713 | 0.481509 | 0.480284 |
| 2025-06→09 | 18 | 0.0001 | 0.455950 | 0.457612 | 0.455951 |
| 2025-09→11 | 47 | 0.0001 | 0.468151 | 0.469494 | 0.468152 |
| 2025-11→02 | 80 | 0.0001 | 0.510271 | 0.505586 | 0.510265 |
| 2026-02→05 | 63 | 0.0028 | 0.525971 | 0.524360 | 0.525849 |
Median fitted α = 0.0014 (vs shipped 0.05).
Why the hand-set prior beats the honest fit
The fitted α collapses toward zero on 5 of 8 folds (α ≈ 0.0001) and only moves meaningfully on 3 folds (0.027, 0.041, 0.029). This is the classic sparse-signal failure mode:
- The validation and eval populations differ. The in-fold optimiser minimises Brier on the recent 180-day slice; the eval window is the next 90 days, a partly different set of teams/fixtures. With only dozens of covered matches on each side, the α that's best on the validation slice is a noisy estimate of the α that's best on the eval window.
- The Brier-vs-α surface is nearly flat. Because the offset moves only ~12% of matches by a small log-rate amount, the in-fold objective barely changes with α — so the optimiser frequently parks at the lower bound (α ≈ 0).
- The fixed 0.05 is a stronger, pooled prior. The grid that chose 0.05 effectively pooled the signal across all eight walks. That pooling is exactly the regularisation the per-fold fit gives up. At this sample size, the pooled prior generalises better than fitting freely in each fold — even though pooling is the mild look-ahead we set out to remove.
The "right-censored grid" hypothesis is partially confirmed — two folds want α > 0.05 (0.0405, 0.0291) — but the signal is far too noisy fold-to-fold for a continuous per-fold fit to capitalise on that. The direction is real (every non-zero α improves the pooled grid Brier); the per-fold fittability is not there.
Decision
No ship. Keep the hand-set α = 0.05 offset (Model 16). A free per-fold fit underperforms it because the composite signal is too sparse (≈12% of matches, dozens per fold) to fit honestly in-fold; the pooled grid's 0.05 is the better-regularised choice at this n.
The real lever is coverage, not coefficient form: the offset will only ever touch WC-vs-WC fixtures until the player composite is extended beyond the 48 qualifiers. Re-running this experiment is worthwhile only after the composite's match coverage materially increases (e.g. club competitions, or a broader international player pool) — at which point a fitted or even tier-conditional α may become honestly estimable. Until then the grid-tuned scalar is the right tool.
Caveats
- Negative result, but a clean one. The harness reproduces the
incumbent baselines exactly, the fit is strictly in-fold (no
look-ahead), and the gate is the project-standard conjunction routed
through
metrics.apply_conjunction_gate. The conclusion is not an artefact of the experiment design. - No production artefact was mutated. This experiment only writes a
gitignored sidecar;
dixon_coles.json/hierarchical_poisson.jsonretain their shippedcomposite_alpha = 0.05. - Metrics are on the DC+HP mean, not the full Elo+DC+HP ensemble.
Folded into the full ensemble (Elo diluting the offset), the effect
is smaller still — consistent with the calibration-plateau diagnosis
in
ds-models-plan.md.