Status: Design only. No code written, no fit run, no decision taken.
Author date: 2026-05-26
Companion code (current): scripts/fit_dixon_coles.py, scripts/ensemble.py, scripts/backtest_composite_offset.py (walk-forward harness template).
Hypothesis
The shipping Dixon-Coles fit (scripts/fit_dixon_coles.py) gives every
team one attack parameter α_t and one defense parameter
β_t for the entire fitting window. Time-decay is applied on the
match-likelihood weights (5-year half-life on each match's
contribution), but the parameters themselves are stationary — a fit
that absorbs both the 2018 Argentina and the 2022 Argentina produces a
single compromise. The question this note scopes: would per-team
parameters that vary through time — α_t(t), β_t(t) — buy
Brier / log-loss / ECE lift on the 8×90d walk-forward gate, and what
would it cost to find out.
Three corpus observations that motivate the question:
- Argentina 2018 → 2022. Lost in the 2018 round-of-16 to France on a 4-3 high-variance match; won the 2022 World Cup. The stationary fit averages across both regimes, dampening the late-window strength signal that an autumn-2022 prediction should reflect.
- Germany 2014 → 2018. World champion to group-stage exit in four years. The stationary fit, weighted by 5-year half-life, has roughly equal contributions from both eras at the 2018 cutoff and produces a Germany rating well above the post-2018 truth.
- Mid-cycle regime shifts. Managerial change (Spain post-2018, Italy post-Euro-2020), generational turnover (Croatia post-2018, Belgium post-2022), and political/structural disruption (Russia excluded post-2022, Ukraine playing in exile) all produce step changes that a 5-year-window MLE doesn't model. The time-decay weighting attenuates old observations but doesn't tell the model when a regime ended; the fit pools across regimes.
The conjecture is that per-fixture prediction error has a
non-trivial component coming from this within-team drift, and that a
state-space model — one that lets α_t(t) and β_t(t) walk through
time — can pull a few basis points of Brier improvement on the gate
without breaking the ensemble's calibration story.
The conjecture is plausible but not strong. The rest-day ablation
(documentation/research-notes/rest-day-ablation.md) found a
statistically very precise α with no Brier lift, because the DC
parameters had already absorbed the effect through team identity.
Time-varying parameters face a related risk: the corpus is sparse
enough per-team that a flexible state-space model may just be re-
absorbing the same signal in a different parameterisation.
Approach options
Three candidate variants, ordered roughly by implementation cost.
| Variant | Parameterisation | Inference | Hyperparameters | Per-team minimum | Comp. cost (relative to baseline DC) |
|---|---|---|---|---|---|
| (a) EMA on (α_t, β_t) | Block-by-block refit; each fit's parameters are an EMA of the prior fit's with tunable half-life h_param. | Sequential MLE — re-fit DC at each of K timestamps, blend with prior fit. | One: h_param. (Match-likelihood time-decay h_match stays.) | 20 matches/team total — same as baseline. | ~K× baseline (K refits). For K=8 walks, ~8× — modest. |
| (b) Kalman / state-space | Latent state θ_t = (α_t, β_t) per team, AR(1) or random-walk transition θ_{t+1} = θ_t + ε_t, ε ~ N(0, Q). Match emits Poisson goals conditional on current θ. | Extended Kalman filter (non-Gaussian emissions) or Laplace approx; smoothing pass for retrospective θ_t. | Two per team OR shared: Q (process noise) for α, β. Plus initial-state prior. | Each team's θ is a time series — needs ≥ ~10 matches across the window AND ≥ ~5 matches in the recent regime to be useful. | 5-20× baseline depending on smoother. Worse-than-linear in T × N_teams. |
| (c) Random-walk Bayesian | Joint hierarchical model: α_t(t_k) − α_t(t_{k-1}) ~ N(0, σ_α²) with σ shared across teams or partial-pooled. | NUTS / HMC sampler (Stan, PyMC, or NumPyro). Posterior over the full α_t(·) trajectory per team. | σ_α, σ_β (random-walk scales). Posteriors carry calibrated uncertainty. | Same as (b) — needs trajectory coverage per team. | 50-500× baseline. NUTS on this many parameters (~200 teams × ~T timestamps × 2) is heavy. |
A few notes on the comparison:
- (a) EMA is the operational upgrade. Re-run DC at K timestamps,
blend per-team via
α_t^{(k)} = (1−w) α_t^{(k−1)} + w α_t^{MLE,k}withw = 1 − 2^(−Δt/h_param). ~50 lines on top of the existing fit. Failure mode: a noisy per-walk MLE just gets smoothed. - (b) Kalman is the principled middle ground — proper Poisson
emission noise, closed-form
θ_tposterior under linear-Gaussian approximation, backward smoothing. The Poisson emission needs an EKF or Laplace approximation per step; the DCρcorrection adds further non-linearity. - (c) Random-walk Bayesian is the full statistical solution. Posterior trajectories per team, partial pooling of the random-walk variance (so sparse-corpus teams inherit a sensible variance from the global mean), and direct posterior predictive sampling. NUTS on 49k matches × ~270 teams is hours per chain. Publishable methodological contribution; also the variant most likely to overfit the corpus.
Identifiability concerns (cross-cutting):
- The sum-to-zero
(α, β)constraint is currently per-fit. A time-varying fit must decide whether to enforce it per-timestamp (cross-team scale invariant through time) or only at the first timestamp (a uniform shift in league-wide scoring can be absorbed by a globalα_tdrift). Prior: per-timestamp — cleaner. home_advantageandρcan in principle vary too. v0 should hold them fixed; making them vary risks calibration drift.- The match-likelihood time-decay (
HALF_LIFE_DAYS = 5 * 365) competes with the parameter-evolution timescale. Setting both produces a double-decay that's hard to reason about. v0 should drop the match-likelihood decay in the state-space variant and let the parameter evolution be the only time-decay mechanism.
Recommendation
Start with (a) EMA. Promote to (b) Kalman only if (a) clears the gate and there's appetite for a methodological writeup.
Reasoning:
- The cheapest path to a yes/no on the underlying hypothesis is the simplest model. If the corpus contains exploitable within-team drift, an EMA over per-walk MLE refits will pick up part of it. If the EMA doesn't move Brier, neither will the more sophisticated variants — they all chase the same signal.
- The walk-forward harness already produces per-walk DC refits
(
backtest_models.fit_models_pre_cutoff). The EMA variant reuses this infrastructure: K refits at K cutoffs, then a blend. The code surface is small. - The Kalman / random-walk variants only become worth the cost if either (i) the EMA shows lift and you want to extract more, or (ii) the project decides to publish a methodological note and needs principled uncertainty quantification.
- There is no compliance risk to either variant — both are pure probability publication. The recommendation is purely a cost/value call.
The decision lever for the user is whether the upgrade is a one-shot ablation or a publishable methodological contribution. The first points to (a); the second points to (b) or (c). Both are defensible.
Data inventory
The corpus is documented in documentation/methodology.md. For this
upgrade:
data/raw/intl/results.csv— 49,329 matches (martj42 mirror), date / home / away / scores / tournament / neutral flag. This is the binding constraint: the international corpus is small relative to club-football state-space studies (which routinely have 100k+ matches per league across decades).data/raw/intl/xg.csv— partial xG coverage on intl matches, consumable via--use-xgin the existing fit. Coverage is well below 50% even on recent matches; a state-space fit on xG-only would be too sparse. The state-space variant should default to goals as the response (matching baseline DC).data/wc2026/intl_elo_history.csv— rolled-forward Elo per team per match date. Not used by DC directly but available as a state-space initialisation prior if needed (centre each team'sα_t(t_0)on its Elo att_0).- Per-team match counts in the 5-year window — ~50-80 matches for major federations (BRA, ARG, ESP, FRA, GER, ENG, …), 15-40 for mid-tier (CMR, NGA, USA, KOR, JPN, …), 5-15 for small federations. The MIN_MATCHES_PER_TEAM = 20 floor already trims the tail; the state-space variant likely needs to raise this to 30 or 40, since fitting a trajectory needs more data than fitting a single point estimate.
What's missing: the corpus does not include continuous match-level covariates that would help disentangle drift from noise (e.g. roster-quality time series, ELO-of-opposition time series). These could be added as fixed-effect controls in a Bayesian variant but aren't in the pipeline today.
Bottleneck: for variants (b) and (c), the binding constraint is per-team trajectory coverage, not total corpus size. A team like São Tomé and Príncipe has 8 matches in the 5-year window — there's literally not enough signal to fit a 2D trajectory. The state-space fit needs to either (i) raise the min-matches floor (losing ~50 teams that the baseline currently fits), or (ii) partial-pool the random-walk variance so sparse teams effectively inherit a near- stationary trajectory. The Bayesian variant (c) gets (ii) for free; the Kalman variant (b) needs explicit handling.
Evaluation plan
Mirror the conjunction-gate format used by
scripts/calibrate_gk_offset.py and
scripts/backtest_composite_offset.py, since the methodology page and
the audit doc both reference it.
| Field | Value |
|---|---|
| Walks | 8 |
| Window per walk | 90 days |
| Most recent walk | 2026-02-24 → 2026-05-25 |
| Earliest walk | 2024-06-04 → 2024-09-02 |
| Per-walk fit | EMA-DC + HP refit on pre-cutoff matches (HP held stationary as baseline; only DC swaps) |
| Tunable | h_param half-life on parameter evolution, grid {∞ (baseline), 4y, 2y, 1y, 6mo, 3mo} |
| Comparison surface | DC component standalone (raw Brier / log-loss / ECE) AND DC's contribution to the uniform ensemble |
| Acceptance gate | conjunction: median Brier strictly lower than h_param = ∞ baseline AND median ECE within +0.2pp |
| Monotone check | as in gk-offset-8walk-confirm.md — improvement should be monotone or at least non-reversing across the grid; a one-point peak with degraded neighbours is a red flag for overfit |
| Tournament-only ECE | reported separately on the K ≥ 50 slice (per tournament-only-backtest.md) — the WC predict path is the one that matters most |
Per-walk reporting follows the shipped notes' table format:
| h_param | DC median Brier | DC ensemble median Brier | Δ vs baseline (median) | ECE delta |
|---:|---:|---:|---:|---:|
| ∞ (baseline) | … | … | — | — |
| 4y | … | … | … | … |
| 2y | … | … | … | … |
| 1y | … | … | … | … |
| 6mo | … | … | … | … |
| 3mo | … | … | … | … |
Shipping thresholds (mirrored from the GK-offset note):
- A Brier improvement of ≥ 0.5 bp (median, across 8 walks) is the rough floor of what's distinguishable from noise on a corpus of this size. Below that, the rest-day ablation precedent says "don't ship".
- ECE within ±0.2pp of baseline is the calibration half of the conjunction. The state-space variant could plausibly help here too (better-calibrated late-window probabilities), but the gate doesn't reward ECE improvement, only forbid ECE regression.
- Tournament-only ECE within ±0.5pp (looser, since n is smaller and the metric is noisier on 100-200 tournament matches per walk).
Negative result format: if the gate fails on every h_param, the
note ships as state-space-dc-ablation.md with the same skeleton as
rest-day-ablation.md — clear failure-mode discussion, what the
backtest can and cannot tell you, no code merged to production.
Scope estimate
Honest range for the (a) EMA variant, assuming the recommended path:
| Phase | Days | Notes |
|---|---|---|
| Data prep / harness wiring | 1 | Mostly reuses backtest_models.fit_models_pre_cutoff. The per-walk refit loop already exists. |
| EMA blending logic + min-matches re-floor | 1 | The blending step is ~50 lines. Re-tuning MIN_MATCHES_PER_TEAM and verifying no major federations drop out is the more careful part. |
| Walk-forward gate run (full grid × 8 walks) | 1-2 | Each walk's DC refit takes ~2-3 minutes; 6 h_param values × 8 walks ≈ 2 hours of CPU per run. Expect 2-3 runs (initial, after iterating on bugs, final). |
| Reliability diagram + tournament-only slice | 0.5 | Pulls from scripts/backtest_models.py --tournaments-only. |
| Writeup | 1-1.5 | The note itself (≈ 1500-3000 words, mirror format), plus the web/public/research/notes/ mirror copy, plus methodology-page row if it ships. |
| Total for (a) | 4.5-6 days | One-week scope if everything goes well, ten days if there's an inconvenient bug. |
For (b) Kalman, multiply by ~3-4× — the smoother + EKF implementation is a several-day project on its own, the diagnostics are more demanding, and the writeup needs more methodological care (prior choice, identifiability discussion, computational reproducer). ~3 weeks.
For (c) Random-walk Bayesian, multiply by ~6-8× — building the PyMC / NumPyro model, validating chain convergence, running posterior predictive checks on the walk-forward gate. ~4-6 weeks, and that's assuming Stan-equivalent tooling is in working order on the dev host.
Risks
- Overfitting on per-team trends. A 6-month
h_paramover an 8- walk gate effectively fits 8 separate DC models per team. With ~270 teams and 49k matches in the window, that's ~22 matches per team per walk on average — fine for the major federations, far too few for the tail. The EMA partially mitigates this (smoothing across walks), but the Bayesian variant's partial pooling is the only formal protection. - Sparse-team behaviour. Federations with ≤ 30 matches in 5 years have effectively no trajectory to fit. The state-space variant needs an explicit fallback path: either re-raise MIN_MATCHES_PER_TEAM (losing coverage) or fall back to a stationary fit per team (special-cased in the predict path). The current MIN_MATCHES of 20 is already tight for a 2-parameter-per-team model; a 2-parameter- trajectory model needs more.
- Computational cost compounds with grid size. The proposed
h_paramgrid is 6 values × 8 walks = 48 DC refits. Each refit is 2-3 minutes on the current host; the total is 1.5-2.5 CPU-hours per run. (a) EMA stays within this envelope; (b) Kalman pushes it to 8-12 hours per run; (c) Bayesian to 50-100+ hours. The walk-forward gate is the bottleneck for all three. - Ensemble interaction. The current ensemble (uniform mean of
Elo, DC, HP; per-class isotonic calibrator on the result) was
tuned with stationary DC. Swapping DC for a state-space variant
may shift the DC component's contribution enough that the
calibrator's per-tier curves need a re-fit — and the per-class
isotonic from
fit_ensemble_calibrator.pyis the surface that matters for the published probabilities. Plan: gate the swap on both raw-ensemble Brier AND calibrated-ensemble Brier; if the two diverge, hold the ship until the calibrator is re-fit. - Identifiability drift across timestamps. A per-timestamp
sum-to-zero constraint on
(α, β)is the clean choice, but it means a uniform shift in the league's offensive level (e.g. a 2020-era post-COVID scoring dip) gets attributed to noise rather than absorbed by a globalα_tshift. Worth pre-registering the choice and noting in the writeup. - Backwards compatibility with downstream callers.
fit_dixon_coles.jsonis read byensemble.py,backtest_composite_offset.py,predict_match, and the dashboard. A state-space fit produces a time-indexedα_t(t)andβ_t(t), not a single value per team. The artefact format needs to either (i) carry the full trajectory and downstream callers learn to pull the most-recent timestep, or (ii) snapshot the current-time values into the legacy slots and carry the trajectory as a separate block. Option (ii) is cleaner for the v0 ship and preserves the "downstream callers that only readattack/defensekeep working unchanged" guarantee that the existing code documents. Worth doing. - Time-decay double-counting. As noted in §3 — the existing
HALF_LIFE_DAYS = 5 * 365competes with the state-space evolution timescale. The v0 form should drop the match-likelihood time-decay in the state-space variant; if it doesn't, the effective decay is the minimum of the two and the gate'sh_paramgrid is misleading. Pre-register the choice in the design note before running the backtest.
Open questions for the user
- One-shot ablation or methodological contribution? The primary lever. "Does the hypothesis pay off on Brier?" → (a) EMA, 4-6 days. "Publishable state-space note alongside Dixon-Coles 1997 / Baio-Blangiardo 2010?" → (c) Bayesian, multi-week.
h_paramgrid: discrete or continuous? Shipped notes use small discrete grids with a monotone-across-grid check baked into the gate (gk-offset-8walk-confirm.md). A continuous search (golden-section) is more precise but loses the diagnostic. Prior: discrete.- Floor on per-team match count? Current MIN_MATCHES is 20 for the stationary fit. Trajectory fitting needs more — 30? 40? Higher is safer but loses coverage on smaller federations.
- Should
home_advantageandρalso be time-varying? Prior: no, hold them fixed in v0. But the literature notes reduced home-effect in post-COVID crowd-less matches — there's a real story that says they should drift too. Worth surfacing. - Drop the match-likelihood time-decay in the state-space variant? Prior: yes, let parameter evolution be the only time-decay mechanism. Alternative: keep both as a stacked smoother. Pre-register before running the gate.
- Artefact format — trajectory in legacy slots, or sidecar?
Prior: sidecar (preserves the backwards-compat guarantee that
downstream readers of
attack/defensekeep working). User may prefer the cleaner long-term option: bake trajectory intodixon_coles.jsonand bump the format version. - If (a) EMA fails the gate — stop, or proceed to (b)? Prior: stop, by the rest-day precedent. But there are reasons (b) could succeed where (a) doesn't (proper uncertainty, partial pooling). Pre-register a stop rule so the project doesn't chase the upgrade indefinitely.
This note is a design only. No code has been written, no fit has been run, no decision has been taken. The next action is the user's call on questions 1, 2, and 4-7 above.