This post documents the complete prediction system behind onthepitch's 2026 World Cup forecasts. It's intended as a permanent reference — one page with every model component, every parameter value, and every backtest metric.
Architecture overview
The system has three layers:
- Component models: three independent models that each produce a home/draw/away probability for any international fixture.
- Ensemble: combines the three components into a single probability, applies extremization and calibration.
- Bracket simulator: feeds the ensemble probabilities into a 50,000-iteration Monte Carlo simulation of the full 48-team tournament structure.
Component model 1 — Elo
The simplest model. Each team carries an Elo rating from eloratings.net, updated after every international match.
Match prediction formula:
- Rating gap:
d = elo_home − elo_away + home_advantage - Expected score:
E_home = 1 / (1 + 10^(−d/400)) - Draw probability: fixed at 22% (international football empirical baseline)
- Home/away probabilities:
E_homeand1 − E_homescaled to fill the remaining 78%
Key parameters:
| Parameter | Value | Source |
|---|---|---|
| Draw probability | 0.22 | International football base rate |
| Home advantage bonus | 100 Elo points | eloratings.net standard |
| Fallback Elo | 1500 | Teams missing from the CSV |
The Elo model is deliberately simple. It has no team-specific attack/defence split, no time decay beyond what eloratings.net applies internally, and no match-importance weighting. It serves as the ensemble's anchor — the component least likely to overfit.
Component model 2 — Dixon-Coles with time decay
A bivariate Poisson model following Dixon & Coles (1997), extended with exponential time decay.
Each team has an attack strength (α) and a defence strength (β). The expected goals for a fixture are:
λ_home = exp(home_advantage + α_home − β_away)λ_away = exp(α_away − β_home)
Goals are Poisson-distributed. The Dixon-Coles correction (ρ) adjusts the joint probability of low-scoring outcomes (0–0, 0–1, 1–0, 1–1) to account for the empirical dependence between home and away goals.
Key parameters:
| Parameter | Value | Notes |
|---|---|---|
| Half-life | 1,825 days (5 years) | Longer than club football (~18 months) because international matches are ~5× sparser per team |
| Training window | 10 years | Matches older than this are excluded |
| Min matches per team | 20 | Teams with fewer matches are dropped from the fit |
| Max goals cap | 10 | Poisson PMF truncated here |
| K-factor weighting | Yes | Tournament and qualifier matches weighted higher than friendlies |
Fit procedure: Maximum likelihood estimation via L-BFGS-B with a sum-to-zero identifiability constraint on attack parameters. Training corpus: ~9,200 international results (within the 10-year window). The fit runs on data strictly before the prediction date — no look-ahead.
Component model 3 — Hierarchical Poisson (MAP)
Same goal-scoring likelihood as Dixon-Coles, but with Gaussian priors that partially pool team strengths toward a global mean. This is the Baio & Blangiardo (2010) specification.
Priors:
| Parameter | Prior |
|---|---|
| Attack (α_i) | N(0, 0.5²) |
| Defence (β_i) | N(0, 0.5²) |
| Home advantage | N(0.3, 0.2²) |
| ρ (DC correction) | N(0, 0.1²) |
Key differences from Dixon-Coles:
- The Gaussian prior means teams with fewer matches are shrunk toward the global mean rather than being dropped. This lets the model estimate strengths for teams with as few as 10 matches (vs DC's minimum of 20).
- No sum-to-zero constraint needed — the prior identifies the model.
- Production uses MAP (maximum a posteriori) point estimates. A full NUTS posterior via PyMC is available as a parallel path for credible intervals but is not in the ensemble predict loop.
The ensemble
The three component models are combined into a single probability per fixture.
Combination method: Uniform average. Each component gets equal weight (1/3 each). If a component cannot produce a prediction for a fixture (e.g., the team is missing from DC's training set), it is dropped and the remaining components split the weight equally.
Why uniform weights? Bayesian stacking weights were fitted (Yao et al. 2018, LOO log-score optimisation). The result: DC 63%, Elo 32%, HP 6%. But the stacking ensemble did not beat uniform averaging on the walk-forward gate (median Brier 0.4954 vs 0.4941 for uniform). The per-walk weights were also unstable (Elo weight ranged 0.24–0.51 across walks), confirming a flat optimum. Uniform averaging ships because it's more robust.
Extremization (d = 1.15)
After pooling, probabilities are pushed away from the uniform prior (1/3) in log-odds space:
lo_k = log(p_k / prior) for each class k ∈ {H, D, A}
elo_k = d × lo_k
p_k_new ∝ prior × exp(elo_k)
This corrects for the underconfidence inherent in linear probability pools (Ranjan & Gneiting, 2010). A calibrated average of calibrated forecasters is necessarily underconfident — extremization is the standard fix.
Backtest validation: 8-fold walk-forward, tournament matches only (n=311):
- d=1.00 (no extremization): Brier 0.5084
- d=1.15: Brier 0.5063 (−21 basis points)
- Improvement is monotonic from d=1.00 to d=1.30; d=1.15 is a conservative choice from the literature.
Per-tier calibration
After extremization, the ensemble applies a calibration adjustment that depends on the match type:
| Tier | Method | n (training) | Platt temperature |
|---|---|---|---|
| Friendly | Isotonic (PAV) | 411 | T = 0.863 |
| Qualifier | Isotonic (PAV) | 457 | T = 0.917 |
| Tournament | Platt scaling | 70 | T = 1.157 |
The tournament tier uses Platt temperature scaling (a single parameter) rather than isotonic calibration because n=70 tournament matches is too few for stable isotonic curves. A preference override triggers automatically when the tournament tier has < 200 training matches.
Calibration ECE (5-fold cross-validated):
- Uncalibrated: 4.12pp
- After per-tier calibration: 3.49pp (isotonic), 3.73pp (Platt)
- Tournament tier specifically: 11.71pp uncalibrated → 9.38pp (isotonic) / 10.71pp (Platt)
Goalkeeper defence offset
The starting goalkeeper's quality adjusts the expected goals conceded:
λ_home_adj = λ_home × exp(−α × centred_rating[away_gk])
λ_away_adj = λ_away × exp(−α × centred_rating[home_gk])
Ratings are centred on the mean across all 48 WC teams (zero-mean offset). α = 0.05, selected as the grid-search winner across {0.0, 0.001, 0.005, 0.01, 0.02, 0.05} — the improvement was monotone.
Gate result: passed. 8×90d walk-forward: median Brier 0.49398 vs 0.49409 baseline (+1.16 basis points).
Monte Carlo bracket simulator
The ensemble probabilities feed into a full simulation of the 2026 FIFA World Cup bracket.
Key parameters:
| Parameter | Value |
|---|---|
| Simulations | 50,000 |
| Goal baseline | 2.5 goals/match |
| Elo–goal scale | 0.0015 per Elo point |
| Third-place advance | 8 of 12 |
| Third-place combinations | 495 |
| PK resolution | Markov model (Model 15) when available; 50/50 fallback |
Group-stage draw factor: The raw ensemble draw probability is multiplied by 1.05 before group-stage simulations, then renormalised. This corrects for the empirical observation that group-stage matches at World Cups produce slightly more draws than the model's base rate predicts. The factor was selected by Brier-minimising sweep on 692 historical WC group-stage matches.
Confidence intervals: 50 bootstrap snapshots × 5,000 sims each. The bootstrap resamples the training data, refits all component models, and re-runs the bracket. The published CI is the 5th–95th percentile across bootstrap snapshots.
Backtest performance
8-fold walk-forward, 90-day windows, tournament matches only (n=311 common-subset matches across 7 valid folds):
| Model | Brier (mean) | Brier (weighted) | Log-loss (weighted) |
|---|---|---|---|
| Elo | 0.562 | 0.513 | 0.972 |
| Dixon-Coles | 0.569 | 0.513 | 0.876 |
| Hierarchical Poisson | 0.571 | 0.516 | 0.880 |
| Ensemble | 0.557 | 0.506 | 0.864 |
The ensemble beats every component on both Brier and log-loss. The margin is modest — the value is in robustness across regimes, not in any single fold.
Uniform 3-class baseline: Brier = 0.667. The ensemble's 0.506 represents a 24% improvement over random.
What's not in the model
Several features were tested and rejected:
- Set-piece-aware Dixon-Coles: Brier +0.0079 worse. Not shipped.
- Style-matchup pair effects: Did not clear the walk-forward gate. Not shipped.
- Confederation-pooled Hierarchical Poisson: Overall Brier regressed −57bp despite per-tier improvements. Not shipped.
- Bayesian stacking weights: Did not beat uniform averaging on median Brier. Not shipped.
- HistGradientBoosting meta-learner: Median Brier 0.534 vs uniform 0.498 — worse on every walk. Not shipped.
Negative results stay on the record. They constrain the design space and prevent re-running experiments that have already been run.
Model version history
| # | Component | Status |
|---|---|---|
| 1 | Elo + MC bracket simulator | Shipping |
| 2 | Dixon-Coles MLE with time decay | Shipping |
| 3 | Hierarchical Poisson MAP | Shipping |
| 3b | Confederation-pooled HP | Tested, not shipped |
| 4 | Player composite rating (FBref per-90 + TM valuation) | Data pipeline |
| 4b | Goalkeeper-specific PSxG rating | Data pipeline |
| 5 | Tournament-scorer probability | Shipping (anytime scorer) |
| 14 | Set-piece-aware DC | Tested, not shipped |
| 15 | PK proficiency Markov model | Shipping |
| 16 | Player-composite-differential offset (α=0.05) | Shipping |
| 17 | Style-matchup pair effects | Tested, not shipped |
| 18 | Starting-GK defence offset (α=0.05) | Shipping |
Numbers in this post are pinned to the May 28 model state. The methodology docs at /docs/methodology/ carry the canonical version.
All numbers are model outputs. They are for research and educational purposes only — not betting advice, not financial advice, not recommendations to gamble. Methodology: /docs/methodology/. Full Terms of Use.