The current World Cup prediction model ships nine features on top of the base three-component ensemble. Getting to those nine required testing nineteen. Ten didn't survive the gate. This post is about the ten.
Why publish the rejections
Most prediction models tell you what they do. Few tell you what they tried and stopped doing. That asymmetry makes models look more deliberate than they are — as if every shipped feature was an obvious good idea from the start.
In practice, model development is hypothesis → test → usually no. Publishing the no-ships does three things:
- Calibrates trust. If we only publish the wins, the model looks overfit to narrative. Showing the losses gives you the real hit rate.
- Prevents re-invention. Anyone building on this work (including future versions of ourselves) can see what was already tested and why it failed.
- Documents the gate. Every rejection below was judged by the same 8-fold walk-forward Brier + ECE gate used for the shipped features. The gate is the story, not any individual result.
The gate
Every variant is tested against the same protocol:
- 8 × 90-day walk-forward folds covering June 2024 – May 2026
- Primary metric: Brier score (lower is better — measures calibration + discrimination jointly)
- Secondary metric: Expected Calibration Error (ECE) — must not regress, even if Brier improves
- Conjunction rule: a variant must improve Brier AND not worsen ECE to ship
The tournament-tier subset (~70 matches per fold on average) is the binding surface. Friendlies and qualifiers carry less weight because the tournament is what we're forecasting.
The ten rejections
1. Rest-day adjustment
Hypothesis: Teams with fewer rest days between matches perform worse. Adjusting expected goals for rest differential should improve predictions.
Result: The effect is statistically significant (p ≈ 10⁻¹⁰⁸ — not a typo). But the practical lift is -0.00016 Brier, and ECE regressed by +0.27 percentage points. Dixon-Coles already absorbs the rest-day effect through its time-decay weighting of recent results. Adding a separate rest-day term double-counts.
Verdict: Real effect, already captured. No ship.
2. State-space (time-varying) Dixon-Coles
Hypothesis: Team strength changes over time. A state-space model with exponential moving average (EMA) over per-walk MLE refits should track these changes and improve on the static DC fit.
Result: Every half-life tested (180 days to 2,880 days) degraded Brier by +31 to +67 basis points. The international match corpus is too sparse for time-varying parameters to help — the EMA adds noise faster than it tracks signal. Where club-level models (thousands of matches per season) benefit from time-varying attack/defence, international football (~150 matches per year across all teams) doesn't have the density.
Verdict: Fundamentally mismatched to the data regime. No ship. Kalman and Bayesian variants are parked for post-corpus-expansion review.
3. Style-matchup pair effects
Hypothesis: Some tactical styles match up better against others. An 8×8 tactical fingerprint grid (built from StatsBomb style vectors) should capture pair-specific effects that the team-level models miss.
Result: Optimal shrinkage was 0.95+ — meaning the prior (no style effect) dominates. The pair-effect matrix can't be estimated reliably from international data because most team pairs meet rarely, and squad composition changes between encounters. Both the strict and content gates failed.
Verdict: Insufficient data to learn pair effects. No ship.
4. Per-tier ensemble weights
Hypothesis: The three component models might perform differently on tournament matches versus friendlies. Learning tier-specific ensemble weights (instead of uniform 1/3 each) should improve calibration.
Result: The tournament-tier weights shifted heavily toward Elo (0.70 Elo / 0.25 DC / 0.05 HP), and per-tier Brier improved. But overall Brier regressed by -57 basis points because the tier-specific weights overfit to each tier's idiosyncrasies. The uniform average is more robust precisely because it doesn't try to be clever.
Verdict: Overfitting to tier boundaries. No ship.
5. Hidden motivations (must-win / dead-rubber)
Hypothesis: Teams in "must-win" situations (needing a result to advance) play differently from teams in "dead-rubber" situations (already eliminated or qualified). A motivation state variable should improve group-stage predictions.
Result: Motivation states are almost perfectly collinear with matchday (Matchday 3 is where dead rubbers and must-wins cluster). After controlling for that confounder, zero residual motivation effect remains in the 830-match group-stage corpus. The apparent signal was entirely driven by matchday position.
Verdict: Confounded. No signal after control. No ship.
6. Per-matchday draw factors
Hypothesis: Draw rates vary by matchday (Matchday 1 produces more draws than Matchday 3, where teams need results). Matchday-specific draw factors should improve on the flat group-stage draw factor.
Result: Improvement of 0.0010 Brier, but the confidence interval includes zero. The flat draw factor (1.05 post-extremization) captures the bulk of the effect. Splitting by matchday adds three free parameters for negligible lift.
Verdict: Below the significance threshold. No ship.
7. Tournament-only composite-alpha refit
Hypothesis: The composite-differential offset (which adjusts expected goals based on player quality) uses alpha=0.05. Refitting alpha on tournament matches only might find a different optimum.
Result: Optimal alpha on the tournament subset was 0.02 (smaller — less aggressive). Lift: 0.000156 Brier. Below practical significance by an order of magnitude.
Verdict: The all-competition alpha is close enough. No ship.
8. Predicted-XI start probability for scorer model
Hypothesis: The anytime-scorer model uses a caps-based proxy for minutes. Replacing it with a predicted starting-XI probability (from Model 4) should better capture which players actually appear on the pitch.
Result: Discretizing the continuous start probability into four buckets (0–25%, 25–50%, 50–75%, 75–100%) lost ranking power compared to the caps proxy. Brier regressed by +0.01. The caps proxy, for all its crudeness, carries more granular information about playing time than a four-level categorical.
Verdict: Discretization destroyed signal. Continuous variant is a future opportunity. No ship.
9. StatsBomb GK PSxG (international, A1 gate)
Hypothesis: International goalkeeper post-shot expected goals (PSxG) data from StatsBomb should improve the GK offset model, which currently uses only club-level PSxG.
Result: The initial test showed an apparent +44 basis point improvement. On investigation, the gate was scoring uncalibrated probabilities — the "improvement" was an extremization artefact, not a feature-quality signal. When scored against calibrated probabilities at a matched nudge, the lift dropped to +2.7bp versus the existing +0.9bp. Not enough to justify the data dependency.
Verdict: Gate measurement error inflated the result. No ship.
10. StatsBomb set-piece conversion (B1 gate)
Hypothesis: Set-piece conversion rates (corners, free kicks) from StatsBomb's international match data should improve goal expectations for teams that are unusually good or bad at set pieces.
Result: Median Brier improved by -2.7 basis points, but ECE blew out by +1.2 percentage points. The conjunction gate (Brier must improve AND ECE must not regress) killed it. The set-piece signal improves discrimination but worsens calibration — a classic sign of a noisy covariate that moves predictions in the right direction on average but pushes them too far on individual matches.
Verdict: Fails the conjunction gate. No ship.
What shipped instead
Nine features survived the same gate and are in the production model:
| Feature | Brier lift | Note |
|---|---|---|
| GK offset (club PSxG) | -1.16bp | Starting-goalkeeper quality as a defence multiplier |
| Composite-differential | -2.29bp | Player-quality-adjusted expected goals |
| Cross-league strength | (calibration) | Big-5 league multiplier on per-90 stats |
| Scorer xS/xG decomposition | (calibration) | Shots x shrunk(xG/shot) replaces raw xG |
| Hierarchical calibrator | ECE 16.6%→10.7% | Per-tier Platt + isotonic |
| HP posterior (NUTS) | +0.0003 Brier | Matches MAP; enables credible intervals |
| Group-stage draw factor | -2.7% Brier | Calibrator bypass + draw scaling for groups |
| Extremization (d=1.15) | (calibration) | Confidence sharpening post-ensemble |
| Draw factor interaction | (calibration) | Post-extremization re-optimisation |
The aggregate lift is modest. The value of the ensemble is robustness across regimes, not any single breakthrough. (On the recent walk-forward the ensemble's tournament Brier reads 0.506, but that figure reuses a current rating snapshot that mildly leaks; the honest, fully leakage-free tournament Brier is 0.572 across 987 matches from 2014–2024 — see the calibration scoreboard.)
What's next
Three open opportunities sit at the top of the backlog:
-
Confirmed-XI override at kickoff. Set start_prob=1.0 from actual lineups when they're announced ~75 minutes before kick-off. The most promising scorer-model upgrade — replaces a noisy proxy with ground truth.
-
Ensemble weight refactor. The tournament-only backtest shows Elo matching or exceeding Dixon-Coles on Brier. The current uniform 1/3 split may underweight Elo for World Cup specifically. A simpler "upweight Elo" heuristic is worth testing — distinct from the per-tier approach that failed (#4 above) because it doesn't segment by tier.
-
GK offset re-gating. The current gate scored uncalibrated probabilities (#9 taught us this). Re-running the gate on calibrated probabilities would give a cleaner read on whether the shipped +1.16bp is real.
All three have existing infrastructure and clear hypotheses. Results will be published in the same format as above: hypothesis, test, verdict.
The full corpus of research notes — including the shipped variants and more detailed write-ups of each rejection — lives at /research/notes/. The negative-results index is at /research/negative-results/.
All metrics in this post are from the project's own walk-forward backtests. They are for research and educational purposes only — not recommendations of any kind. Methodology: /docs/methodology/. Full Terms of Use.
