Nota de investigación

Letting team ratings drift over time (didn't improve predictions)

Estado: Not shipped. See decision gate at the bottomFecha del backtest: 27 May 2026Lectura completa y gratuita · 1,566 palabras

Nota completa · gratis

Status: Not shipped. See decision gate at the bottom. Backtest date: 2026-05-27 Reproducer: scripts/backtest_state_space_dc.py --folds 8 --window-days 90 --h-grid 180 360 720 1440 2880 --today 2026-05-25 --fast Persisted output: data/wc2026/state_space_dc_gate.json (gitignored; regenerable per host) Design note: documentation/research-notes/state-space-dc-design.md

Hypothesis

Per the design note (variant a, "EMA on (α_t, β_t)"): each team's attack/defence parameters should EVOLVE through time rather than absorb every era's matches into a single stationary compromise. Refit DC at K snapshot timestamps (= the 8 quarterly walk cutoffs), blend each team's parameters across snapshots via an exponential moving average with tunable half-life h_param, and check whether the blended parameter trajectory beats the stationary baseline on walk-forward Brier / ECE.

Reasoning:

  • Argentina 2018 → 2022, Germany 2014 → 2018, Spain post-2018 etc. exhibit material within-team drift that the stationary fit averages across.
  • The shipping match-likelihood time-decay (5-year half-life on the baseline DC fit) attenuates old matches but doesn't tell the model WHEN a regime ended. It pools across regimes.
  • An EMA over per-snapshot MLE refits is the cheapest variant to test the hypothesis: ~50 lines of code on top of the existing fit, K refits per gate.

Implementation summary

  • scripts/fit_dixon_coles_state_space.py — refits DC at each snapshot timestamp on the FULL pre-cutoff corpus with uniform match weighting (match-likelihood half-life dropped per design §3.7: "let parameter evolution be the only decay mechanism"), then blends per-team parameters across snapshots via α_t^(k) = (1−w_k) · α_t^(k−1) + w_k · α_t^(MLE,k) with w_k = 1 − 2^(−Δt_k/h_param).
  • Sum-to-zero on (α, β) re-enforced per snapshot (design §3.1).
  • home_advantage and ρ held FIXED across snapshots in v0 (design §3.2), taken from the most recent snapshot's MLE.
  • Artefact format: data/wc2026/dixon_coles_state_space.json carries the full per-snapshot trajectory PLUS a legacy attack/defense/... block at the root pointing at the latest snapshot — downstream callers that only read those keys would work unchanged (design §3.6, option ii).
  • Gate runner: scripts/backtest_state_space_dc.py sweeps the h_param grid and writes the per-h_param table to data/wc2026/state_space_dc_gate.json. Per-snapshot and per-walk MLE fits are cached, so the full gate runs ~16 unique L-BFGS-B refits rather than O(walks × h_grid_size).
  • A --fast mode loosens the L-BFGS-B tolerances (gtol=1e-5, ftol=1e-7, maxfun=30k) — the gate run that produced these numbers used --fast to fit inside a 3-hour wall-clock budget. Spot-checked with a synthetic ablation against a single full-precision fit (gtol=1e-7): the relative ordering of state-space-vs-baseline doesn't flip, individual Brier differs by ~1bp.

Backtest setup

FieldValue
Walks8
Window per walk90 days
Most recent walk2026-02-24 → 2026-05-25
Earliest walk2024-06-04 → 2024-09-02
Per-walk trainingmatches strictly before walk's fit_until
Per-snapshot fitfull pre-cutoff corpus with uniform match weights (half_life_days = 1e9), 10-year window, min 20 matches/team
h_param grid{180d, 360d, 720d, 1440d, 2880d} (6mo, 1y, 2y, 4y, 8y)
Baselineshipping stationary DC (scripts/fit_dixon_coles.py, 5y half-life on matches, same window + min-matches)
Acceptance gateconjunction: median Brier strictly lower than baseline AND median ECE within +0.2pp, evaluated on BOTH raw AND isotonic-calibrated metrics

Result — gate fails on every h_param

Median across 8 walks (lower Brier and lower ECE are better):

Settingraw Brierraw ECEcal Briercal ECEΔBrier(raw)ΔBrier(cal)ΔECE(raw)ΔECE(cal)Gate
Baseline DC0.503766.42pp0.506518.34pp
State-space 180d0.506897.53pp0.511919.77pp+31.4bp+54.0bp+1.11pp+1.42ppfail
State-space 360d0.508017.88pp0.513879.62pp+42.6bp+73.6bp+1.45pp+1.27ppfail
State-space 720d0.509137.76pp0.515999.14pp+53.7bp+94.7bp+1.33pp+0.80ppfail
State-space 1440d0.509946.67pp0.517158.38pp+61.8bp+106.4bp+0.25pp+0.04ppfail
State-space 2880d0.510436.92pp0.517828.27pp+66.7bp+113.1bp+0.50pp−0.07ppfail

Every h_param degrades Brier on both raw and calibrated metrics. The smallest-h_param variants (180d, 360d) — closest to "fully per-walk MLE" — degrade Brier the LEAST (raw ΔBrier ≈ +30-45bp) but degrade ECE the MOST (calibrated ECE +1.27-1.42pp). The largest-h_param variants (1440d, 2880d) — closest to a long-window stationary fit — preserve calibration (cal ECE within ±0.1pp of baseline) but degrade Brier even more (raw ΔBrier ≈ +60-67bp). The Brier-vs-ECE trade is monotone across the grid; no setting splits the difference.

The per-walk picture is more textured (data/wc2026/state_space_dc_gate.json carries the full per-walk table). State-space matches or modestly beats baseline on Brier in walks 1 and 3 (the Brier delta is negative at 180d for walks 1, 3) but loses by 30-180bp on walks 2, 4-8. ECE is similarly mixed but the conjunction never aligns.

Why the negative result is plausible

  1. The intl corpus is sparse enough that per-walk MLE adds noise faster than it adds signal. The design's risk §1 ("overfitting on per-team trends") and §2 ("sparse-team behaviour") are both real. With ~22 matches/team/walk on average — comfortable for major federations, far too few for the tail — the per-walk MLE wobbles team parameters in ways the EMA only partially absorbs. The rest-day-ablation precedent applies again: the DC parameters had already absorbed the effect through team identity.

  2. Dropping the match-likelihood time-decay throws away signal that the EMA doesn't recover. The shipping baseline's 5-year half-life on matches IS a form of temporal weighting (matches from 5y ago count ~half as much as fresh). The state-space variant trades this match-level decay for a snapshot-level decay, but snapshots are coarse (90d granularity) and the team-level smoothing is per-snapshot, not per-match. The effective amount of "old data emphasis" the state-space variant gives is less smooth than the baseline's exponential decay, even at high h_param. The largest h_param values (1440d, 2880d) approach the long-window-MLE-with-uniform-weights endpoint — which loses to long-window-MLE-with-5y-decay on Brier directly.

  3. Calibrator drag is real. The isotonic calibrator was fit against the stationary DC's ensemble outputs. The state-space variant produces a slightly different output distribution per fixture and the calibrator over-corrects toward the stationary curve, which is part of why calibrated metrics degrade MORE than raw metrics for the small-h_param variants. A re-fit calibrator could close part of this gap — but the raw-Brier degradation (+31bp at best) is far above the design's "0.5bp distinguishable-from-noise floor", so re-fitting the calibrator on a worse-Brier component is unlikely to flip the verdict.

Caveats

  • Limited h_param grid. Five values spanning 6mo → 8y. A finer grid (or a golden-section search) might find a marginal improvement that the discrete sweep misses. The monotone-across-grid pattern (Brier monotonically worsens as h_param grows; ECE monotonically improves toward baseline as h_param grows) leaves no obvious gap that's hiding a sweet spot — both extremes lose for OPPOSITE reasons (small-h: noisy parameters; large-h: lost temporal weighting from dropping match decay).
  • --fast mode used for the gate run. The L-BFGS-B tolerances were loosened (gtol=1e-5, ftol=1e-7) to fit 16 MLE refits inside the 3-hour budget. A single-walk full-precision rerun (gtol=1e-7, ftol=1e-9) confirmed the per-fixture predictions differ by ~1bp Brier — well below the ~30bp gate margin. The verdict is robust to convergence tolerance.
  • Tournament-only slice not evaluated separately. The design's evaluation plan suggested a separate K ≥ 50 tournament slice. Time constraints — the gate already runs 16 MLE refits. Given the overall gate fails by ~30-60bp Brier (1-2 orders of magnitude above the "noise floor"), it's improbable that the tournament-only slice would flip the verdict; a follow-up could confirm.
  • Sparse-team handling. The state-space variant retains the baseline MIN_MATCHES_PER_TEAM = 20 floor; teams that fall below the floor at snapshot k inherit their previous blended value unchanged. A higher floor (30 or 40) might quiet the per-snapshot noise but loses coverage on smaller federations — the design's §risks 2 flagged this as an unresolved tension.
  • home_advantage and ρ held fixed. v0 design choice — letting them vary is a knob the design parked for v1.

Decision

Do not ship. The state-space EMA variant (a) fails the conjunction gate on every h_param value tested. Median Brier is ≥31bp worse than baseline across the entire grid (vs the design's "0.5bp distinguishable-from-noise floor"), and ECE is within tolerance only for the two largest h_param values — which lose on Brier by even more.

Following the design's stop-rule (questions §7): "If (a) EMA fails the gate — stop, or proceed to (b)? Prior: stop, by the rest-day precedent." This matches the GK-offset-confirm and composite-α precedents — when a single-knob ablation fails monotonically across its grid, the next variant typically inherits the same signal-vs-noise problem. The Kalman / Bayesian variants (b)/(c) would face the same per-team sparsity that's driving (a)'s noise; they offer better uncertainty quantification, not more signal.

The exploration is worth the documented negative result: the design hypothesis was plausible, the infrastructure for state-space DC is in place if a future data expansion (a third WC cycle, or a richer per-team covariate set) makes the hypothesis worth re-testing, and the scripts + tests are reusable for that re-test. No production code changes.