Nota di ricerca

Can team strength change mid-season? Design for a time-varying model

Stato: Design only. No code written, no fit run, no decision takenLettura integrale gratuita · 2,934 parole

Nota completa · gratuita

Status: Design only. No code written, no fit run, no decision taken. Author date: 2026-05-26 Companion code (current): scripts/fit_dixon_coles.py, scripts/ensemble.py, scripts/backtest_composite_offset.py (walk-forward harness template).

Hypothesis

The shipping Dixon-Coles fit (scripts/fit_dixon_coles.py) gives every team one attack parameter α_t and one defense parameter β_t for the entire fitting window. Time-decay is applied on the match-likelihood weights (5-year half-life on each match's contribution), but the parameters themselves are stationary — a fit that absorbs both the 2018 Argentina and the 2022 Argentina produces a single compromise. The question this note scopes: would per-team parameters that vary through timeα_t(t), β_t(t) — buy Brier / log-loss / ECE lift on the 8×90d walk-forward gate, and what would it cost to find out.

Three corpus observations that motivate the question:

  • Argentina 2018 → 2022. Lost in the 2018 round-of-16 to France on a 4-3 high-variance match; won the 2022 World Cup. The stationary fit averages across both regimes, dampening the late-window strength signal that an autumn-2022 prediction should reflect.
  • Germany 2014 → 2018. World champion to group-stage exit in four years. The stationary fit, weighted by 5-year half-life, has roughly equal contributions from both eras at the 2018 cutoff and produces a Germany rating well above the post-2018 truth.
  • Mid-cycle regime shifts. Managerial change (Spain post-2018, Italy post-Euro-2020), generational turnover (Croatia post-2018, Belgium post-2022), and political/structural disruption (Russia excluded post-2022, Ukraine playing in exile) all produce step changes that a 5-year-window MLE doesn't model. The time-decay weighting attenuates old observations but doesn't tell the model when a regime ended; the fit pools across regimes.

The conjecture is that per-fixture prediction error has a non-trivial component coming from this within-team drift, and that a state-space model — one that lets α_t(t) and β_t(t) walk through time — can pull a few basis points of Brier improvement on the gate without breaking the ensemble's calibration story.

The conjecture is plausible but not strong. The rest-day ablation (documentation/research-notes/rest-day-ablation.md) found a statistically very precise α with no Brier lift, because the DC parameters had already absorbed the effect through team identity. Time-varying parameters face a related risk: the corpus is sparse enough per-team that a flexible state-space model may just be re- absorbing the same signal in a different parameterisation.

Approach options

Three candidate variants, ordered roughly by implementation cost.

VariantParameterisationInferenceHyperparametersPer-team minimumComp. cost (relative to baseline DC)
(a) EMA on (α_t, β_t)Block-by-block refit; each fit's parameters are an EMA of the prior fit's with tunable half-life h_param.Sequential MLE — re-fit DC at each of K timestamps, blend with prior fit.One: h_param. (Match-likelihood time-decay h_match stays.)20 matches/team total — same as baseline.~K× baseline (K refits). For K=8 walks, ~8× — modest.
(b) Kalman / state-spaceLatent state θ_t = (α_t, β_t) per team, AR(1) or random-walk transition θ_{t+1} = θ_t + ε_t, ε ~ N(0, Q). Match emits Poisson goals conditional on current θ.Extended Kalman filter (non-Gaussian emissions) or Laplace approx; smoothing pass for retrospective θ_t.Two per team OR shared: Q (process noise) for α, β. Plus initial-state prior.Each team's θ is a time series — needs ≥ ~10 matches across the window AND ≥ ~5 matches in the recent regime to be useful.5-20× baseline depending on smoother. Worse-than-linear in T × N_teams.
(c) Random-walk BayesianJoint hierarchical model: α_t(t_k) − α_t(t_{k-1}) ~ N(0, σ_α²) with σ shared across teams or partial-pooled.NUTS / HMC sampler (Stan, PyMC, or NumPyro). Posterior over the full α_t(·) trajectory per team.σ_α, σ_β (random-walk scales). Posteriors carry calibrated uncertainty.Same as (b) — needs trajectory coverage per team.50-500× baseline. NUTS on this many parameters (~200 teams × ~T timestamps × 2) is heavy.

A few notes on the comparison:

  • (a) EMA is the operational upgrade. Re-run DC at K timestamps, blend per-team via α_t^{(k)} = (1−w) α_t^{(k−1)} + w α_t^{MLE,k} with w = 1 − 2^(−Δt/h_param). ~50 lines on top of the existing fit. Failure mode: a noisy per-walk MLE just gets smoothed.
  • (b) Kalman is the principled middle ground — proper Poisson emission noise, closed-form θ_t posterior under linear-Gaussian approximation, backward smoothing. The Poisson emission needs an EKF or Laplace approximation per step; the DC ρ correction adds further non-linearity.
  • (c) Random-walk Bayesian is the full statistical solution. Posterior trajectories per team, partial pooling of the random-walk variance (so sparse-corpus teams inherit a sensible variance from the global mean), and direct posterior predictive sampling. NUTS on 49k matches × ~270 teams is hours per chain. Publishable methodological contribution; also the variant most likely to overfit the corpus.

Identifiability concerns (cross-cutting):

  • The sum-to-zero (α, β) constraint is currently per-fit. A time-varying fit must decide whether to enforce it per-timestamp (cross-team scale invariant through time) or only at the first timestamp (a uniform shift in league-wide scoring can be absorbed by a global α_t drift). Prior: per-timestamp — cleaner.
  • home_advantage and ρ can in principle vary too. v0 should hold them fixed; making them vary risks calibration drift.
  • The match-likelihood time-decay (HALF_LIFE_DAYS = 5 * 365) competes with the parameter-evolution timescale. Setting both produces a double-decay that's hard to reason about. v0 should drop the match-likelihood decay in the state-space variant and let the parameter evolution be the only time-decay mechanism.

Recommendation

Start with (a) EMA. Promote to (b) Kalman only if (a) clears the gate and there's appetite for a methodological writeup.

Reasoning:

  1. The cheapest path to a yes/no on the underlying hypothesis is the simplest model. If the corpus contains exploitable within-team drift, an EMA over per-walk MLE refits will pick up part of it. If the EMA doesn't move Brier, neither will the more sophisticated variants — they all chase the same signal.
  2. The walk-forward harness already produces per-walk DC refits (backtest_models.fit_models_pre_cutoff). The EMA variant reuses this infrastructure: K refits at K cutoffs, then a blend. The code surface is small.
  3. The Kalman / random-walk variants only become worth the cost if either (i) the EMA shows lift and you want to extract more, or (ii) the project decides to publish a methodological note and needs principled uncertainty quantification.
  4. There is no compliance risk to either variant — both are pure probability publication. The recommendation is purely a cost/value call.

The decision lever for the user is whether the upgrade is a one-shot ablation or a publishable methodological contribution. The first points to (a); the second points to (b) or (c). Both are defensible.

Data inventory

The corpus is documented in documentation/methodology.md. For this upgrade:

  • data/raw/intl/results.csv — 49,329 matches (martj42 mirror), date / home / away / scores / tournament / neutral flag. This is the binding constraint: the international corpus is small relative to club-football state-space studies (which routinely have 100k+ matches per league across decades).
  • data/raw/intl/xg.csv — partial xG coverage on intl matches, consumable via --use-xg in the existing fit. Coverage is well below 50% even on recent matches; a state-space fit on xG-only would be too sparse. The state-space variant should default to goals as the response (matching baseline DC).
  • data/wc2026/intl_elo_history.csv — rolled-forward Elo per team per match date. Not used by DC directly but available as a state-space initialisation prior if needed (centre each team's α_t(t_0) on its Elo at t_0).
  • Per-team match counts in the 5-year window — ~50-80 matches for major federations (BRA, ARG, ESP, FRA, GER, ENG, …), 15-40 for mid-tier (CMR, NGA, USA, KOR, JPN, …), 5-15 for small federations. The MIN_MATCHES_PER_TEAM = 20 floor already trims the tail; the state-space variant likely needs to raise this to 30 or 40, since fitting a trajectory needs more data than fitting a single point estimate.

What's missing: the corpus does not include continuous match-level covariates that would help disentangle drift from noise (e.g. roster-quality time series, ELO-of-opposition time series). These could be added as fixed-effect controls in a Bayesian variant but aren't in the pipeline today.

Bottleneck: for variants (b) and (c), the binding constraint is per-team trajectory coverage, not total corpus size. A team like São Tomé and Príncipe has 8 matches in the 5-year window — there's literally not enough signal to fit a 2D trajectory. The state-space fit needs to either (i) raise the min-matches floor (losing ~50 teams that the baseline currently fits), or (ii) partial-pool the random-walk variance so sparse teams effectively inherit a near- stationary trajectory. The Bayesian variant (c) gets (ii) for free; the Kalman variant (b) needs explicit handling.

Evaluation plan

Mirror the conjunction-gate format used by scripts/calibrate_gk_offset.py and scripts/backtest_composite_offset.py, since the methodology page and the audit doc both reference it.

FieldValue
Walks8
Window per walk90 days
Most recent walk2026-02-24 → 2026-05-25
Earliest walk2024-06-04 → 2024-09-02
Per-walk fitEMA-DC + HP refit on pre-cutoff matches (HP held stationary as baseline; only DC swaps)
Tunableh_param half-life on parameter evolution, grid {∞ (baseline), 4y, 2y, 1y, 6mo, 3mo}
Comparison surfaceDC component standalone (raw Brier / log-loss / ECE) AND DC's contribution to the uniform ensemble
Acceptance gateconjunction: median Brier strictly lower than h_param = ∞ baseline AND median ECE within +0.2pp
Monotone checkas in gk-offset-8walk-confirm.md — improvement should be monotone or at least non-reversing across the grid; a one-point peak with degraded neighbours is a red flag for overfit
Tournament-only ECEreported separately on the K ≥ 50 slice (per tournament-only-backtest.md) — the WC predict path is the one that matters most

Per-walk reporting follows the shipped notes' table format:

| h_param | DC median Brier | DC ensemble median Brier | Δ vs baseline (median) | ECE delta |
|---:|---:|---:|---:|---:|
| ∞ (baseline)   | … | … | — | — |
| 4y             | … | … | … | … |
| 2y             | … | … | … | … |
| 1y             | … | … | … | … |
| 6mo            | … | … | … | … |
| 3mo            | … | … | … | … |

Shipping thresholds (mirrored from the GK-offset note):

  • A Brier improvement of ≥ 0.5 bp (median, across 8 walks) is the rough floor of what's distinguishable from noise on a corpus of this size. Below that, the rest-day ablation precedent says "don't ship".
  • ECE within ±0.2pp of baseline is the calibration half of the conjunction. The state-space variant could plausibly help here too (better-calibrated late-window probabilities), but the gate doesn't reward ECE improvement, only forbid ECE regression.
  • Tournament-only ECE within ±0.5pp (looser, since n is smaller and the metric is noisier on 100-200 tournament matches per walk).

Negative result format: if the gate fails on every h_param, the note ships as state-space-dc-ablation.md with the same skeleton as rest-day-ablation.md — clear failure-mode discussion, what the backtest can and cannot tell you, no code merged to production.

Scope estimate

Honest range for the (a) EMA variant, assuming the recommended path:

PhaseDaysNotes
Data prep / harness wiring1Mostly reuses backtest_models.fit_models_pre_cutoff. The per-walk refit loop already exists.
EMA blending logic + min-matches re-floor1The blending step is ~50 lines. Re-tuning MIN_MATCHES_PER_TEAM and verifying no major federations drop out is the more careful part.
Walk-forward gate run (full grid × 8 walks)1-2Each walk's DC refit takes ~2-3 minutes; 6 h_param values × 8 walks ≈ 2 hours of CPU per run. Expect 2-3 runs (initial, after iterating on bugs, final).
Reliability diagram + tournament-only slice0.5Pulls from scripts/backtest_models.py --tournaments-only.
Writeup1-1.5The note itself (≈ 1500-3000 words, mirror format), plus the web/public/research/notes/ mirror copy, plus methodology-page row if it ships.
Total for (a)4.5-6 daysOne-week scope if everything goes well, ten days if there's an inconvenient bug.

For (b) Kalman, multiply by ~3-4× — the smoother + EKF implementation is a several-day project on its own, the diagnostics are more demanding, and the writeup needs more methodological care (prior choice, identifiability discussion, computational reproducer). ~3 weeks.

For (c) Random-walk Bayesian, multiply by ~6-8× — building the PyMC / NumPyro model, validating chain convergence, running posterior predictive checks on the walk-forward gate. ~4-6 weeks, and that's assuming Stan-equivalent tooling is in working order on the dev host.

Risks

  1. Overfitting on per-team trends. A 6-month h_param over an 8- walk gate effectively fits 8 separate DC models per team. With ~270 teams and 49k matches in the window, that's ~22 matches per team per walk on average — fine for the major federations, far too few for the tail. The EMA partially mitigates this (smoothing across walks), but the Bayesian variant's partial pooling is the only formal protection.
  2. Sparse-team behaviour. Federations with ≤ 30 matches in 5 years have effectively no trajectory to fit. The state-space variant needs an explicit fallback path: either re-raise MIN_MATCHES_PER_TEAM (losing coverage) or fall back to a stationary fit per team (special-cased in the predict path). The current MIN_MATCHES of 20 is already tight for a 2-parameter-per-team model; a 2-parameter- trajectory model needs more.
  3. Computational cost compounds with grid size. The proposed h_param grid is 6 values × 8 walks = 48 DC refits. Each refit is 2-3 minutes on the current host; the total is 1.5-2.5 CPU-hours per run. (a) EMA stays within this envelope; (b) Kalman pushes it to 8-12 hours per run; (c) Bayesian to 50-100+ hours. The walk-forward gate is the bottleneck for all three.
  4. Ensemble interaction. The current ensemble (uniform mean of Elo, DC, HP; per-class isotonic calibrator on the result) was tuned with stationary DC. Swapping DC for a state-space variant may shift the DC component's contribution enough that the calibrator's per-tier curves need a re-fit — and the per-class isotonic from fit_ensemble_calibrator.py is the surface that matters for the published probabilities. Plan: gate the swap on both raw-ensemble Brier AND calibrated-ensemble Brier; if the two diverge, hold the ship until the calibrator is re-fit.
  5. Identifiability drift across timestamps. A per-timestamp sum-to-zero constraint on (α, β) is the clean choice, but it means a uniform shift in the league's offensive level (e.g. a 2020-era post-COVID scoring dip) gets attributed to noise rather than absorbed by a global α_t shift. Worth pre-registering the choice and noting in the writeup.
  6. Backwards compatibility with downstream callers. fit_dixon_coles.json is read by ensemble.py, backtest_composite_offset.py, predict_match, and the dashboard. A state-space fit produces a time-indexed α_t(t) and β_t(t), not a single value per team. The artefact format needs to either (i) carry the full trajectory and downstream callers learn to pull the most-recent timestep, or (ii) snapshot the current-time values into the legacy slots and carry the trajectory as a separate block. Option (ii) is cleaner for the v0 ship and preserves the "downstream callers that only read attack/defense keep working unchanged" guarantee that the existing code documents. Worth doing.
  7. Time-decay double-counting. As noted in §3 — the existing HALF_LIFE_DAYS = 5 * 365 competes with the state-space evolution timescale. The v0 form should drop the match-likelihood time-decay in the state-space variant; if it doesn't, the effective decay is the minimum of the two and the gate's h_param grid is misleading. Pre-register the choice in the design note before running the backtest.

Open questions for the user

  1. One-shot ablation or methodological contribution? The primary lever. "Does the hypothesis pay off on Brier?" → (a) EMA, 4-6 days. "Publishable state-space note alongside Dixon-Coles 1997 / Baio-Blangiardo 2010?" → (c) Bayesian, multi-week.
  2. h_param grid: discrete or continuous? Shipped notes use small discrete grids with a monotone-across-grid check baked into the gate (gk-offset-8walk-confirm.md). A continuous search (golden-section) is more precise but loses the diagnostic. Prior: discrete.
  3. Floor on per-team match count? Current MIN_MATCHES is 20 for the stationary fit. Trajectory fitting needs more — 30? 40? Higher is safer but loses coverage on smaller federations.
  4. Should home_advantage and ρ also be time-varying? Prior: no, hold them fixed in v0. But the literature notes reduced home-effect in post-COVID crowd-less matches — there's a real story that says they should drift too. Worth surfacing.
  5. Drop the match-likelihood time-decay in the state-space variant? Prior: yes, let parameter evolution be the only time-decay mechanism. Alternative: keep both as a stacked smoother. Pre-register before running the gate.
  6. Artefact format — trajectory in legacy slots, or sidecar? Prior: sidecar (preserves the backwards-compat guarantee that downstream readers of attack/defense keep working). User may prefer the cleaner long-term option: bake trajectory into dixon_coles.json and bump the format version.
  7. If (a) EMA fails the gate — stop, or proceed to (b)? Prior: stop, by the rest-day precedent. But there are reasons (b) could succeed where (a) doesn't (proper uncertainty, partial pooling). Pre-register a stop rule so the project doesn't chase the upgrade indefinitely.

This note is a design only. No code has been written, no fit has been run, no decision has been taken. The next action is the user's call on questions 1, 2, and 4-7 above.