28 May 2026 · OnThePitch Staff

Model notes — how onthepitch predicts the 2026 World Cup

A complete reference for the prediction system behind onthepitch. Three component models (Elo, Dixon-Coles, Hierarchical Poisson), a calibrated ensemble, and a Monte Carlo bracket simulator. Every parameter, every design choice, every backtest number — in one place.

This post documents the complete prediction system behind onthepitch's 2026 World Cup forecasts. It's intended as a permanent reference — one page with every model component, every parameter value, and every backtest metric.

Architecture overview

The system has three layers:

  1. Component models: three independent models that each produce a home/draw/away probability for any international fixture.
  2. Ensemble: combines the three components into a single probability, applies extremization and calibration.
  3. Bracket simulator: feeds the ensemble probabilities into a 50,000-iteration Monte Carlo simulation of the full 48-team tournament structure.

Component model 1 — Elo

The simplest model. Each team carries an Elo rating from eloratings.net, updated after every international match.

Match prediction formula:

  • Rating gap: d = elo_home − elo_away + home_advantage
  • Expected score: E_home = 1 / (1 + 10^(−d/400))
  • Draw probability: fixed at 22% (international football empirical baseline)
  • Home/away probabilities: E_home and 1 − E_home scaled to fill the remaining 78%

Key parameters:

ParameterValueSource
Draw probability0.22International football base rate
Home advantage bonus100 Elo pointseloratings.net standard
Fallback Elo1500Teams missing from the CSV

The Elo model is deliberately simple. It has no team-specific attack/defence split, no time decay beyond what eloratings.net applies internally, and no match-importance weighting. It serves as the ensemble's anchor — the component least likely to overfit.

Component model 2 — Dixon-Coles with time decay

A bivariate Poisson model following Dixon & Coles (1997), extended with exponential time decay.

Each team has an attack strength (α) and a defence strength (β). The expected goals for a fixture are:

  • λ_home = exp(home_advantage + α_home − β_away)
  • λ_away = exp(α_away − β_home)

Goals are Poisson-distributed. The Dixon-Coles correction (ρ) adjusts the joint probability of low-scoring outcomes (0–0, 0–1, 1–0, 1–1) to account for the empirical dependence between home and away goals.

Key parameters:

ParameterValueNotes
Half-life1,825 days (5 years)Longer than club football (~18 months) because international matches are ~5× sparser per team
Training window10 yearsMatches older than this are excluded
Min matches per team20Teams with fewer matches are dropped from the fit
Max goals cap10Poisson PMF truncated here
K-factor weightingYesTournament and qualifier matches weighted higher than friendlies

Fit procedure: Maximum likelihood estimation via L-BFGS-B with a sum-to-zero identifiability constraint on attack parameters. Training corpus: ~9,200 international results (within the 10-year window). The fit runs on data strictly before the prediction date — no look-ahead.

Component model 3 — Hierarchical Poisson (MAP)

Same goal-scoring likelihood as Dixon-Coles, but with Gaussian priors that partially pool team strengths toward a global mean. This is the Baio & Blangiardo (2010) specification.

Priors:

ParameterPrior
Attack (α_i)N(0, 0.5²)
Defence (β_i)N(0, 0.5²)
Home advantageN(0.3, 0.2²)
ρ (DC correction)N(0, 0.1²)

Key differences from Dixon-Coles:

  • The Gaussian prior means teams with fewer matches are shrunk toward the global mean rather than being dropped. This lets the model estimate strengths for teams with as few as 10 matches (vs DC's minimum of 20).
  • No sum-to-zero constraint needed — the prior identifies the model.
  • Production uses MAP (maximum a posteriori) point estimates. A full NUTS posterior via PyMC is available as a parallel path for credible intervals but is not in the ensemble predict loop.

The ensemble

The three component models are combined into a single probability per fixture.

Combination method: Uniform average. Each component gets equal weight (1/3 each). If a component cannot produce a prediction for a fixture (e.g., the team is missing from DC's training set), it is dropped and the remaining components split the weight equally.

Why uniform weights? Bayesian stacking weights were fitted (Yao et al. 2018, LOO log-score optimisation). The result: DC 63%, Elo 32%, HP 6%. But the stacking ensemble did not beat uniform averaging on the walk-forward gate (median Brier 0.4954 vs 0.4941 for uniform). The per-walk weights were also unstable (Elo weight ranged 0.24–0.51 across walks), confirming a flat optimum. Uniform averaging ships because it's more robust.

Extremization (d = 1.15)

After pooling, probabilities are pushed away from the uniform prior (1/3) in log-odds space:

lo_k = log(p_k / prior)    for each class k ∈ {H, D, A}
elo_k = d × lo_k
p_k_new ∝ prior × exp(elo_k)

This corrects for the underconfidence inherent in linear probability pools (Ranjan & Gneiting, 2010). A calibrated average of calibrated forecasters is necessarily underconfident — extremization is the standard fix.

Backtest validation: 8-fold walk-forward, tournament matches only (n=311):

  • d=1.00 (no extremization): Brier 0.5084
  • d=1.15: Brier 0.5063 (−21 basis points)
  • Improvement is monotonic from d=1.00 to d=1.30; d=1.15 is a conservative choice from the literature.

Per-tier calibration

After extremization, the ensemble applies a calibration adjustment that depends on the match type:

TierMethodn (training)Platt temperature
FriendlyIsotonic (PAV)411T = 0.863
QualifierIsotonic (PAV)457T = 0.917
TournamentPlatt scaling70T = 1.157

The tournament tier uses Platt temperature scaling (a single parameter) rather than isotonic calibration because n=70 tournament matches is too few for stable isotonic curves. A preference override triggers automatically when the tournament tier has < 200 training matches.

Calibration ECE (5-fold cross-validated):

  • Uncalibrated: 4.12pp
  • After per-tier calibration: 3.49pp (isotonic), 3.73pp (Platt)
  • Tournament tier specifically: 11.71pp uncalibrated → 9.38pp (isotonic) / 10.71pp (Platt)

Goalkeeper defence offset

The starting goalkeeper's quality adjusts the expected goals conceded:

λ_home_adj = λ_home × exp(−α × centred_rating[away_gk])
λ_away_adj = λ_away × exp(−α × centred_rating[home_gk])

Ratings are centred on the mean across all 48 WC teams (zero-mean offset). α = 0.05, selected as the grid-search winner across {0.0, 0.001, 0.005, 0.01, 0.02, 0.05} — the improvement was monotone.

Gate result: passed. 8×90d walk-forward: median Brier 0.49398 vs 0.49409 baseline (+1.16 basis points).

Monte Carlo bracket simulator

The ensemble probabilities feed into a full simulation of the 2026 FIFA World Cup bracket.

Key parameters:

ParameterValue
Simulations50,000
Goal baseline2.5 goals/match
Elo–goal scale0.0015 per Elo point
Third-place advance8 of 12
Third-place combinations495
PK resolutionMarkov model (Model 15) when available; 50/50 fallback

Group-stage draw factor: The raw ensemble draw probability is multiplied by 1.05 before group-stage simulations, then renormalised. This corrects for the empirical observation that group-stage matches at World Cups produce slightly more draws than the model's base rate predicts. The factor was selected by Brier-minimising sweep on 692 historical WC group-stage matches.

Confidence intervals: 50 bootstrap snapshots × 5,000 sims each. The bootstrap resamples the training data, refits all component models, and re-runs the bracket. The published CI is the 5th–95th percentile across bootstrap snapshots.

Backtest performance

8-fold walk-forward, 90-day windows, tournament matches only (n=311 common-subset matches across 7 valid folds):

ModelBrier (mean)Brier (weighted)Log-loss (weighted)
Elo0.5620.5130.972
Dixon-Coles0.5690.5130.876
Hierarchical Poisson0.5710.5160.880
Ensemble0.5570.5060.864

The ensemble beats every component on both Brier and log-loss. The margin is modest — the value is in robustness across regimes, not in any single fold.

Uniform 3-class baseline: Brier = 0.667. The ensemble's 0.506 represents a 24% improvement over random.

What's not in the model

Several features were tested and rejected:

  • Set-piece-aware Dixon-Coles: Brier +0.0079 worse. Not shipped.
  • Style-matchup pair effects: Did not clear the walk-forward gate. Not shipped.
  • Confederation-pooled Hierarchical Poisson: Overall Brier regressed −57bp despite per-tier improvements. Not shipped.
  • Bayesian stacking weights: Did not beat uniform averaging on median Brier. Not shipped.
  • HistGradientBoosting meta-learner: Median Brier 0.534 vs uniform 0.498 — worse on every walk. Not shipped.

Negative results stay on the record. They constrain the design space and prevent re-running experiments that have already been run.

Model version history

#ComponentStatus
1Elo + MC bracket simulatorShipping
2Dixon-Coles MLE with time decayShipping
3Hierarchical Poisson MAPShipping
3bConfederation-pooled HPTested, not shipped
4Player composite rating (FBref per-90 + TM valuation)Data pipeline
4bGoalkeeper-specific PSxG ratingData pipeline
5Tournament-scorer probabilityShipping (anytime scorer)
14Set-piece-aware DCTested, not shipped
15PK proficiency Markov modelShipping
16Player-composite-differential offset (α=0.05)Shipping
17Style-matchup pair effectsTested, not shipped
18Starting-GK defence offset (α=0.05)Shipping

Numbers in this post are pinned to the May 28 model state. The methodology docs at /docs/methodology/ carry the canonical version.


All numbers are model outputs. They are for research and educational purposes only — not betting advice, not financial advice, not recommendations to gamble. Methodology: /docs/methodology/. Full Terms of Use.

See the live forecast

This note draws on the same calibrated model that powers the full 2026 World Cup forecast — win probabilities for every fixture, projected line-ups, and the tournament-winner picture, refreshed on every run.

Explore the forecast →

New posts in your inbox

Short-form research notes on the 2026 World Cup probability model — no fixed schedule, just when there's something worth saying.

Already read on Substack? Follow OnThePitch there — same posts, slightly different format.

1,604 words · published 28 May 2026

#methodology#model#ensemble#calibration#backtest#reference