Retuning the models for tournament football — what changed

Status: Not shipped (no production change). Two of three were already in production; the third refit cleared the gate's strict-less-than condition but missed the practical-significance bar. Backtest date: 2026-05-23 Reproducer: scripts/backtest_composite_offset.py --folds 1 --window-days 1825 --tournaments-only --today 2026-05-22 --output data/wc2026/composite_offset_backtest_tournaments.json Persisted output: data/wc2026/composite_offset_backtest_tournaments.json (gitignored; re-run reproduces). Related: documentation/research-notes/tournament-only-backtest.md, documentation/methodology.md

Why this exists

PR #310 documented that all four models in the ensemble are ~7% worse on tournament matches than on the all-matches average. The natural follow-up is to refit the predict-time knobs on a tournament-only training slice and serve tournament-variant artefacts at WC fixture predict time.

The three targets in scope were:

composite_alpha — the player-aware DC offset coefficient that turns a Δ in starting-XI composite into a Δ in log-λ. PR #305 calibrated it on all-matches at α = 0.05.
Ensemble meta-learner — the gradient-boosted classifier (scripts/fit_ensemble_meta.py) that learns ensemble weights from data.
Isotonic calibrator — the per-class PAV curves (scripts/fit_ensemble_calibrator.py) that calibrate raw ensemble probabilities to observed frequencies.

The hypothesis: tournament matches reward team depth and calibrate differently from friendlies; tournament-variant artefacts ought to improve Brier on the matches that actually matter.

This note reports the audit (what's already wired) plus the one new ablation that was worth running.

What was already shipped

Tier-aware calibration. scripts/fit_ensemble_calibrator.py already fits curves_by_tier alongside the pooled curves, with tournament, qualifier, and friendly as the three tiers. ensemble.py branches on the match's tournament string at predict time (via calibrator_tier(...)) and applies the tournament-tier curve when the match is a major tournament. The acceptance gate at fit time requires the tier-aware ECE to strictly beat pooled ECE on the tournament slice — failing the gate falls back to identity (no-op).

There was no tournament-variant artefact to ship — the production calibrator already carries it.

Injury → team strength. PR #305 (player-aware DC offset) reads the probable XI from predicted_squads.json, sums the per-player composite ratings into team_composite_sum.csv, and applies α × (composite_h − composite_a) to log-λ at predict time. Critically, export_predicted_squads.py filters out players with availability_status == "out" (D16's injury classification) before the probable XI is published. The injury signal therefore flows through the composite chain into λ without any additional plumbing.

When France faces a Mbappé + Saliba + Kanté absence, France's sum_composite drops by approximately 2.5 (each star contributes ~0.85 to the XI composite). At the production α = 0.05 this multiplies France's λ_attack by exp(0.05 × 2.5) ≈ 1.13 — i.e. France's expected goals fall by ~13% relative to a full-strength baseline. A 13% λ shift is a meaningful per-fixture P(H/D/A) movement, not a nominal flag.

The implicit claim in this note's preceding research ("the team-strength models still don't know who's playing") was wrong. They do.

What was newly ablated: tournament-only `composite_alpha`

Setup. Single-fold walk-forward at --folds 1 --window-days 1825 --today 2026-05-22. Fit cutoff 2021-05-23. DC trained on 8,632 pre-cutoff matches; HP trained on 9,029. Evaluation holdout = matches in (2021-05-23, 2026-05-22], filtered to k_factor ≥ 50 (FIFA WC, Euro, Copa, AFCON, AFC Asian Cup, Gold Cup, CONCACAF, Confeds): 881 matches, 797 in the common subset every component reached.

α grid: {0.0, 0.005, 0.01, 0.02, 0.05}.

Result

α	DC Brier	HP Brier	DC+HP avg Brier	Δ vs α=0
0.0	0.5461	0.5499	0.5466	—
0.005	0.5461	0.5498	0.5465	−0.0001
0.01	0.5461	0.5497	0.5465	−0.0001
0.02	0.5461	0.5495	0.5464	−0.0002
0.05	0.5468	0.5496	0.5468	+0.0002

The script's built-in gate ("median Brier strictly lower than baseline") passes at α = 0.02. The practical-significance gate this note's authors had in mind (Brier improvement ≥ 0.001, mirroring the rest-day ablation note) does not pass — the best improvement is 0.000156, roughly an order of magnitude smaller than what we'd require to ship a model change.

Counter to hypothesis

The hypothesis going in was that tournament matches reward composite-XI depth more than friendlies — so the tournament-only α should be larger than the all-matches α (0.05). The data say the opposite: the tournament-only optimum is smaller (0.02), and even at the optimum the contribution to Brier is barely measurable.

Two intelligible explanations:

Tournament squads are uniform on depth. All 26-man tournament rosters carry significant backups. The "playmaker absent" effect that a XI-composite delta is supposed to capture is muted on the tournament slice because all 8 backups are also good. Friendlies, by contrast, see weaker rotation squads where the XI quality varies more match-to-match.
Sample-size noise dominates. 797 common-subset matches is enough to spot a 0.03 Brier delta between models (PR #310's headline finding) but not enough to discriminate between α = 0.01 and α = 0.02. The "winning" α at 0.02 may be sampling artifact rather than a real preference.

Either explanation is consistent with the data; the note's authors find (1) more plausible but (2) is unfalsifiable without bootstrap CIs across many folds (the ~30 min × 8 folds = 4 hr walk-forward we deliberately did not run).

What was NOT ablated: tournament-only meta-learner

The third candidate, refitting scripts/fit_ensemble_meta.py on the tournament-only slice, was scoped but not run. The reasoning:

The meta-learner is a HistGradientBoostingClassifier with ~8 features. Training on ~800 tournament-only matches across 5 years (vs the production ~5,100 all-matches) is a 6× reduction in training data. The classifier's per-feature variance scales roughly inversely with sample size; the learned weights would be high-variance enough that the "best" tournament-only weights probably differ from the all-matches weights by more than the underlying signal they're meant to capture.
The pattern from the composite_alpha refit just above — smaller-than-hypothesised optima, marginal improvements — is exactly what data-thinned model refits look like. The prior for the meta-learner refit producing a shippable result was already low; the α refit lowered it further.
The compute (~30 min walk-forward) and the engineering (predict-time branching, variant artefact serialisation, separate calibrator) would only have been worth it for a real lift.

If a future cycle wants to revisit, the right preconditions would be: (a) more tournament training data (next WC adds ~64 matches; over a 2030 cycle the tournament corpus grows by ~25%), (b) bootstrap CIs on the existing α refit to confirm or rule out the sampling-noise explanation, or (c) a tournament-tier feature on the existing all-matches meta-learner (single-model architecture, no data-thinning) before splitting into variants.

Decision

Refit	Conclusion
Injury → team strength	Already in production (PR #305 + predicted-XI chain). No change.
Tournament-tier calibrator	Already in production (`curves_by_tier`). No change.
`composite_alpha` tournament refit	Improvement below practical-significance gate. No change.
Meta-learner tournament refit	Not run. Data-thinning prior + compute cost. No change.

Verdict: do not ship a tournament-variant model. The two refits that would have been newly added are already wired in production under different names, and the third doesn't clear the bar.

Files touched

scripts/backtest_composite_offset.py — --tournaments-only flag, --output flag, tournaments_only= kwarg threaded through run_grid_search, payload tagged with tournaments_only.
documentation/research-notes/tournament-variant-refits-audit.md — this file.
web/public/research/notes/tournament-variant-refits-audit.md — mirror.

No production model output changed. No composite_alpha / dixon_coles.json / hierarchical_poisson.json patches were applied; the existing all-matches α = 0.05 stays in place.

Reproducing

.venv/bin/python scripts/backtest_composite_offset.py \
  --folds 1 --window-days 1825 --today 2026-05-22 \
  --tournaments-only \
  --output data/wc2026/composite_offset_backtest_tournaments.json

(Requires data/raw/intl/results.csv from python scripts/pull_intl_results.py and the existing data/wc2026/team_composite_sum.csv from python scripts/build_team_composite.py.)

Nota completa · gratuita