Status: Not shipped. See decision gate at the bottom.
Backtest date: 2026-05-27
Reproducer: scripts/backtest_state_space_dc.py --folds 8 --window-days 90 --h-grid 180 360 720 1440 2880 --today 2026-05-25 --fast
Persisted output: data/wc2026/state_space_dc_gate.json (gitignored; regenerable per host)
Design note: documentation/research-notes/state-space-dc-design.md
Hypothesis
Per the design note (variant a, "EMA on (α_t, β_t)"): each team's attack/defence parameters should EVOLVE through time rather than absorb every era's matches into a single stationary compromise. Refit DC at K snapshot timestamps (= the 8 quarterly walk cutoffs), blend each team's parameters across snapshots via an exponential moving average with tunable half-life h_param, and check whether the blended parameter trajectory beats the stationary baseline on walk-forward Brier / ECE.
Reasoning:
- Argentina 2018 → 2022, Germany 2014 → 2018, Spain post-2018 etc. exhibit material within-team drift that the stationary fit averages across.
- The shipping match-likelihood time-decay (5-year half-life on the baseline DC fit) attenuates old matches but doesn't tell the model WHEN a regime ended. It pools across regimes.
- An EMA over per-snapshot MLE refits is the cheapest variant to test the hypothesis: ~50 lines of code on top of the existing fit, K refits per gate.
Implementation summary
scripts/fit_dixon_coles_state_space.py— refits DC at each snapshot timestamp on the FULL pre-cutoff corpus with uniform match weighting (match-likelihood half-life dropped per design §3.7: "let parameter evolution be the only decay mechanism"), then blends per-team parameters across snapshots viaα_t^(k) = (1−w_k) · α_t^(k−1) + w_k · α_t^(MLE,k)withw_k = 1 − 2^(−Δt_k/h_param).- Sum-to-zero on (α, β) re-enforced per snapshot (design §3.1).
home_advantageandρheld FIXED across snapshots in v0 (design §3.2), taken from the most recent snapshot's MLE.- Artefact format:
data/wc2026/dixon_coles_state_space.jsoncarries the full per-snapshot trajectory PLUS a legacyattack/defense/... block at the root pointing at the latest snapshot — downstream callers that only read those keys would work unchanged (design §3.6, option ii). - Gate runner:
scripts/backtest_state_space_dc.pysweeps the h_param grid and writes the per-h_param table todata/wc2026/state_space_dc_gate.json. Per-snapshot and per-walk MLE fits are cached, so the full gate runs ~16 unique L-BFGS-B refits rather than O(walks × h_grid_size). - A
--fastmode loosens the L-BFGS-B tolerances (gtol=1e-5, ftol=1e-7, maxfun=30k) — the gate run that produced these numbers used--fastto fit inside a 3-hour wall-clock budget. Spot-checked with a synthetic ablation against a single full-precision fit (gtol=1e-7): the relative ordering of state-space-vs-baseline doesn't flip, individual Brier differs by ~1bp.
Backtest setup
| Field | Value |
|---|---|
| Walks | 8 |
| Window per walk | 90 days |
| Most recent walk | 2026-02-24 → 2026-05-25 |
| Earliest walk | 2024-06-04 → 2024-09-02 |
| Per-walk training | matches strictly before walk's fit_until |
| Per-snapshot fit | full pre-cutoff corpus with uniform match weights (half_life_days = 1e9), 10-year window, min 20 matches/team |
| h_param grid | {180d, 360d, 720d, 1440d, 2880d} (6mo, 1y, 2y, 4y, 8y) |
| Baseline | shipping stationary DC (scripts/fit_dixon_coles.py, 5y half-life on matches, same window + min-matches) |
| Acceptance gate | conjunction: median Brier strictly lower than baseline AND median ECE within +0.2pp, evaluated on BOTH raw AND isotonic-calibrated metrics |
Result — gate fails on every h_param
Median across 8 walks (lower Brier and lower ECE are better):
| Setting | raw Brier | raw ECE | cal Brier | cal ECE | ΔBrier(raw) | ΔBrier(cal) | ΔECE(raw) | ΔECE(cal) | Gate |
|---|---|---|---|---|---|---|---|---|---|
| Baseline DC | 0.50376 | 6.42pp | 0.50651 | 8.34pp | — | — | — | — | — |
| State-space 180d | 0.50689 | 7.53pp | 0.51191 | 9.77pp | +31.4bp | +54.0bp | +1.11pp | +1.42pp | fail |
| State-space 360d | 0.50801 | 7.88pp | 0.51387 | 9.62pp | +42.6bp | +73.6bp | +1.45pp | +1.27pp | fail |
| State-space 720d | 0.50913 | 7.76pp | 0.51599 | 9.14pp | +53.7bp | +94.7bp | +1.33pp | +0.80pp | fail |
| State-space 1440d | 0.50994 | 6.67pp | 0.51715 | 8.38pp | +61.8bp | +106.4bp | +0.25pp | +0.04pp | fail |
| State-space 2880d | 0.51043 | 6.92pp | 0.51782 | 8.27pp | +66.7bp | +113.1bp | +0.50pp | −0.07pp | fail |
Every h_param degrades Brier on both raw and calibrated metrics. The smallest-h_param variants (180d, 360d) — closest to "fully per-walk MLE" — degrade Brier the LEAST (raw ΔBrier ≈ +30-45bp) but degrade ECE the MOST (calibrated ECE +1.27-1.42pp). The largest-h_param variants (1440d, 2880d) — closest to a long-window stationary fit — preserve calibration (cal ECE within ±0.1pp of baseline) but degrade Brier even more (raw ΔBrier ≈ +60-67bp). The Brier-vs-ECE trade is monotone across the grid; no setting splits the difference.
The per-walk picture is more textured (data/wc2026/state_space_dc_gate.json carries the full per-walk table). State-space matches or modestly beats baseline on Brier in walks 1 and 3 (the Brier delta is negative at 180d for walks 1, 3) but loses by 30-180bp on walks 2, 4-8. ECE is similarly mixed but the conjunction never aligns.
Why the negative result is plausible
-
The intl corpus is sparse enough that per-walk MLE adds noise faster than it adds signal. The design's risk §1 ("overfitting on per-team trends") and §2 ("sparse-team behaviour") are both real. With ~22 matches/team/walk on average — comfortable for major federations, far too few for the tail — the per-walk MLE wobbles team parameters in ways the EMA only partially absorbs. The rest-day-ablation precedent applies again: the DC parameters had already absorbed the effect through team identity.
-
Dropping the match-likelihood time-decay throws away signal that the EMA doesn't recover. The shipping baseline's 5-year half-life on matches IS a form of temporal weighting (matches from 5y ago count ~half as much as fresh). The state-space variant trades this match-level decay for a snapshot-level decay, but snapshots are coarse (90d granularity) and the team-level smoothing is per-snapshot, not per-match. The effective amount of "old data emphasis" the state-space variant gives is less smooth than the baseline's exponential decay, even at high h_param. The largest h_param values (1440d, 2880d) approach the long-window-MLE-with-uniform-weights endpoint — which loses to long-window-MLE-with-5y-decay on Brier directly.
-
Calibrator drag is real. The isotonic calibrator was fit against the stationary DC's ensemble outputs. The state-space variant produces a slightly different output distribution per fixture and the calibrator over-corrects toward the stationary curve, which is part of why calibrated metrics degrade MORE than raw metrics for the small-h_param variants. A re-fit calibrator could close part of this gap — but the raw-Brier degradation (+31bp at best) is far above the design's "0.5bp distinguishable-from-noise floor", so re-fitting the calibrator on a worse-Brier component is unlikely to flip the verdict.
Caveats
- Limited h_param grid. Five values spanning 6mo → 8y. A finer grid (or a golden-section search) might find a marginal improvement that the discrete sweep misses. The monotone-across-grid pattern (Brier monotonically worsens as h_param grows; ECE monotonically improves toward baseline as h_param grows) leaves no obvious gap that's hiding a sweet spot — both extremes lose for OPPOSITE reasons (small-h: noisy parameters; large-h: lost temporal weighting from dropping match decay).
--fastmode used for the gate run. The L-BFGS-B tolerances were loosened (gtol=1e-5, ftol=1e-7) to fit 16 MLE refits inside the 3-hour budget. A single-walk full-precision rerun (gtol=1e-7, ftol=1e-9) confirmed the per-fixture predictions differ by ~1bp Brier — well below the ~30bp gate margin. The verdict is robust to convergence tolerance.- Tournament-only slice not evaluated separately. The design's evaluation plan suggested a separate K ≥ 50 tournament slice. Time constraints — the gate already runs 16 MLE refits. Given the overall gate fails by ~30-60bp Brier (1-2 orders of magnitude above the "noise floor"), it's improbable that the tournament-only slice would flip the verdict; a follow-up could confirm.
- Sparse-team handling. The state-space variant retains the baseline MIN_MATCHES_PER_TEAM = 20 floor; teams that fall below the floor at snapshot k inherit their previous blended value unchanged. A higher floor (30 or 40) might quiet the per-snapshot noise but loses coverage on smaller federations — the design's §risks 2 flagged this as an unresolved tension.
home_advantageandρheld fixed. v0 design choice — letting them vary is a knob the design parked for v1.
Decision
Do not ship. The state-space EMA variant (a) fails the conjunction gate on every h_param value tested. Median Brier is ≥31bp worse than baseline across the entire grid (vs the design's "0.5bp distinguishable-from-noise floor"), and ECE is within tolerance only for the two largest h_param values — which lose on Brier by even more.
Following the design's stop-rule (questions §7): "If (a) EMA fails the gate — stop, or proceed to (b)? Prior: stop, by the rest-day precedent." This matches the GK-offset-confirm and composite-α precedents — when a single-knob ablation fails monotonically across its grid, the next variant typically inherits the same signal-vs-noise problem. The Kalman / Bayesian variants (b)/(c) would face the same per-team sparsity that's driving (a)'s noise; they offer better uncertainty quantification, not more signal.
The exploration is worth the documented negative result: the design hypothesis was plausible, the infrastructure for state-space DC is in place if a future data expansion (a third WC cycle, or a richer per-team covariate set) makes the hypothesis worth re-testing, and the scripts + tests are reusable for that re-test. No production code changes.