Letting team ratings drift over time (didn't improve predictions)

Status: Not shipped. See decision gate at the bottom. Backtest date: 2026-05-27 Reproducer: scripts/backtest_state_space_dc.py --folds 8 --window-days 90 --h-grid 180 360 720 1440 2880 --today 2026-05-25 --fast Persisted output: data/wc2026/state_space_dc_gate.json (gitignored; regenerable per host) Design note: documentation/research-notes/state-space-dc-design.md

Hypothesis

Per the design note (variant a, "EMA on (α_t, β_t)"): each team's attack/defence parameters should EVOLVE through time rather than absorb every era's matches into a single stationary compromise. Refit DC at K snapshot timestamps (= the 8 quarterly walk cutoffs), blend each team's parameters across snapshots via an exponential moving average with tunable half-life h_param, and check whether the blended parameter trajectory beats the stationary baseline on walk-forward Brier / ECE.

Reasoning:

Argentina 2018 → 2022, Germany 2014 → 2018, Spain post-2018 etc. exhibit material within-team drift that the stationary fit averages across.
The shipping match-likelihood time-decay (5-year half-life on the baseline DC fit) attenuates old matches but doesn't tell the model WHEN a regime ended. It pools across regimes.
An EMA over per-snapshot MLE refits is the cheapest variant to test the hypothesis: ~50 lines of code on top of the existing fit, K refits per gate.

Implementation summary

scripts/fit_dixon_coles_state_space.py — refits DC at each snapshot timestamp on the FULL pre-cutoff corpus with uniform match weighting (match-likelihood half-life dropped per design §3.7: "let parameter evolution be the only decay mechanism"), then blends per-team parameters across snapshots via α_t^(k) = (1−w_k) · α_t^(k−1) + w_k · α_t^(MLE,k) with w_k = 1 − 2^(−Δt_k/h_param).
Sum-to-zero on (α, β) re-enforced per snapshot (design §3.1).
home_advantage and ρ held FIXED across snapshots in v0 (design §3.2), taken from the most recent snapshot's MLE.
Artefact format: data/wc2026/dixon_coles_state_space.json carries the full per-snapshot trajectory PLUS a legacy attack/defense/... block at the root pointing at the latest snapshot — downstream callers that only read those keys would work unchanged (design §3.6, option ii).
Gate runner: scripts/backtest_state_space_dc.py sweeps the h_param grid and writes the per-h_param table to data/wc2026/state_space_dc_gate.json. Per-snapshot and per-walk MLE fits are cached, so the full gate runs ~16 unique L-BFGS-B refits rather than O(walks × h_grid_size).
A --fast mode loosens the L-BFGS-B tolerances (gtol=1e-5, ftol=1e-7, maxfun=30k) — the gate run that produced these numbers used --fast to fit inside a 3-hour wall-clock budget. Spot-checked with a synthetic ablation against a single full-precision fit (gtol=1e-7): the relative ordering of state-space-vs-baseline doesn't flip, individual Brier differs by ~1bp.

Backtest setup

Field	Value
Walks	8
Window per walk	90 days
Most recent walk	2026-02-24 → 2026-05-25
Earliest walk	2024-06-04 → 2024-09-02
Per-walk training	matches strictly before walk's `fit_until`
Per-snapshot fit	full pre-cutoff corpus with uniform match weights (`half_life_days = 1e9`), 10-year window, min 20 matches/team
h_param grid	`{180d, 360d, 720d, 1440d, 2880d}` (6mo, 1y, 2y, 4y, 8y)
Baseline	shipping stationary DC (`scripts/fit_dixon_coles.py`, 5y half-life on matches, same window + min-matches)
Acceptance gate	conjunction: median Brier strictly lower than baseline AND median ECE within +0.2pp, evaluated on BOTH raw AND isotonic-calibrated metrics

Result — gate fails on every h_param

Median across 8 walks (lower Brier and lower ECE are better):

Setting	raw Brier	raw ECE	cal Brier	cal ECE	ΔBrier(raw)	ΔBrier(cal)	ΔECE(raw)	ΔECE(cal)	Gate
Baseline DC	0.50376	6.42pp	0.50651	8.34pp	—	—	—	—	—
State-space 180d	0.50689	7.53pp	0.51191	9.77pp	+31.4bp	+54.0bp	+1.11pp	+1.42pp	fail
State-space 360d	0.50801	7.88pp	0.51387	9.62pp	+42.6bp	+73.6bp	+1.45pp	+1.27pp	fail
State-space 720d	0.50913	7.76pp	0.51599	9.14pp	+53.7bp	+94.7bp	+1.33pp	+0.80pp	fail
State-space 1440d	0.50994	6.67pp	0.51715	8.38pp	+61.8bp	+106.4bp	+0.25pp	+0.04pp	fail
State-space 2880d	0.51043	6.92pp	0.51782	8.27pp	+66.7bp	+113.1bp	+0.50pp	−0.07pp	fail

Every h_param degrades Brier on both raw and calibrated metrics. The smallest-h_param variants (180d, 360d) — closest to "fully per-walk MLE" — degrade Brier the LEAST (raw ΔBrier ≈ +30-45bp) but degrade ECE the MOST (calibrated ECE +1.27-1.42pp). The largest-h_param variants (1440d, 2880d) — closest to a long-window stationary fit — preserve calibration (cal ECE within ±0.1pp of baseline) but degrade Brier even more (raw ΔBrier ≈ +60-67bp). The Brier-vs-ECE trade is monotone across the grid; no setting splits the difference.

The per-walk picture is more textured (data/wc2026/state_space_dc_gate.json carries the full per-walk table). State-space matches or modestly beats baseline on Brier in walks 1 and 3 (the Brier delta is negative at 180d for walks 1, 3) but loses by 30-180bp on walks 2, 4-8. ECE is similarly mixed but the conjunction never aligns.

Why the negative result is plausible

The intl corpus is sparse enough that per-walk MLE adds noise faster than it adds signal. The design's risk §1 ("overfitting on per-team trends") and §2 ("sparse-team behaviour") are both real. With ~22 matches/team/walk on average — comfortable for major federations, far too few for the tail — the per-walk MLE wobbles team parameters in ways the EMA only partially absorbs. The rest-day-ablation precedent applies again: the DC parameters had already absorbed the effect through team identity.
Dropping the match-likelihood time-decay throws away signal that the EMA doesn't recover. The shipping baseline's 5-year half-life on matches IS a form of temporal weighting (matches from 5y ago count ~half as much as fresh). The state-space variant trades this match-level decay for a snapshot-level decay, but snapshots are coarse (90d granularity) and the team-level smoothing is per-snapshot, not per-match. The effective amount of "old data emphasis" the state-space variant gives is less smooth than the baseline's exponential decay, even at high h_param. The largest h_param values (1440d, 2880d) approach the long-window-MLE-with-uniform-weights endpoint — which loses to long-window-MLE-with-5y-decay on Brier directly.
Calibrator drag is real. The isotonic calibrator was fit against the stationary DC's ensemble outputs. The state-space variant produces a slightly different output distribution per fixture and the calibrator over-corrects toward the stationary curve, which is part of why calibrated metrics degrade MORE than raw metrics for the small-h_param variants. A re-fit calibrator could close part of this gap — but the raw-Brier degradation (+31bp at best) is far above the design's "0.5bp distinguishable-from-noise floor", so re-fitting the calibrator on a worse-Brier component is unlikely to flip the verdict.

Caveats

Limited h_param grid. Five values spanning 6mo → 8y. A finer grid (or a golden-section search) might find a marginal improvement that the discrete sweep misses. The monotone-across-grid pattern (Brier monotonically worsens as h_param grows; ECE monotonically improves toward baseline as h_param grows) leaves no obvious gap that's hiding a sweet spot — both extremes lose for OPPOSITE reasons (small-h: noisy parameters; large-h: lost temporal weighting from dropping match decay).
--fast mode used for the gate run. The L-BFGS-B tolerances were loosened (gtol=1e-5, ftol=1e-7) to fit 16 MLE refits inside the 3-hour budget. A single-walk full-precision rerun (gtol=1e-7, ftol=1e-9) confirmed the per-fixture predictions differ by ~1bp Brier — well below the ~30bp gate margin. The verdict is robust to convergence tolerance.
Tournament-only slice not evaluated separately. The design's evaluation plan suggested a separate K ≥ 50 tournament slice. Time constraints — the gate already runs 16 MLE refits. Given the overall gate fails by ~30-60bp Brier (1-2 orders of magnitude above the "noise floor"), it's improbable that the tournament-only slice would flip the verdict; a follow-up could confirm.
Sparse-team handling. The state-space variant retains the baseline MIN_MATCHES_PER_TEAM = 20 floor; teams that fall below the floor at snapshot k inherit their previous blended value unchanged. A higher floor (30 or 40) might quiet the per-snapshot noise but loses coverage on smaller federations — the design's §risks 2 flagged this as an unresolved tension.
home_advantage and ρ held fixed. v0 design choice — letting them vary is a knob the design parked for v1.

Decision

Do not ship. The state-space EMA variant (a) fails the conjunction gate on every h_param value tested. Median Brier is ≥31bp worse than baseline across the entire grid (vs the design's "0.5bp distinguishable-from-noise floor"), and ECE is within tolerance only for the two largest h_param values — which lose on Brier by even more.

Following the design's stop-rule (questions §7): "If (a) EMA fails the gate — stop, or proceed to (b)? Prior: stop, by the rest-day precedent." This matches the GK-offset-confirm and composite-α precedents — when a single-knob ablation fails monotonically across its grid, the next variant typically inherits the same signal-vs-noise problem. The Kalman / Bayesian variants (b)/(c) would face the same per-team sparsity that's driving (a)'s noise; they offer better uncertainty quantification, not more signal.

The exploration is worth the documented negative result: the design hypothesis was plausible, the infrastructure for state-space DC is in place if a future data expansion (a third WC cycle, or a richer per-team covariate set) makes the hypothesis worth re-testing, and the scripts + tests are reusable for that re-test. No production code changes.

Nota completa · gratis