Model notes — how onthepitch predicts the 2026 World Cup

This post documents the complete prediction system behind onthepitch's 2026 World Cup forecasts. It's intended as a permanent reference — one page with every model component, every parameter value, and every backtest metric.

Architecture overview

The system has three layers:

Component models: three independent models that each produce a home/draw/away probability for any international fixture.
Ensemble: combines the three components into a single probability, applies extremization and calibration.
Bracket simulator: feeds the ensemble probabilities into a 50,000-iteration Monte Carlo simulation of the full 48-team tournament structure.

Component model 1 — Elo

The simplest model. Each team carries an Elo rating from eloratings.net, updated after every international match.

Match prediction formula:

Rating gap: d = elo_home − elo_away + home_advantage
Expected score: E_home = 1 / (1 + 10^(−d/400))
Draw probability: fixed at 22% (international football empirical baseline)
Home/away probabilities: E_home and 1 − E_home scaled to fill the remaining 78%

Key parameters:

Parameter	Value	Source
Draw probability	0.22	International football base rate
Home advantage bonus	100 Elo points	eloratings.net standard
Fallback Elo	1500	Teams missing from the CSV

The Elo model is deliberately simple. It has no team-specific attack/defence split, no time decay beyond what eloratings.net applies internally, and no match-importance weighting. It serves as the ensemble's anchor — the component least likely to overfit.

Component model 2 — Dixon-Coles with time decay

A bivariate Poisson model following Dixon & Coles (1997), extended with exponential time decay.

Each team has an attack strength (α) and a defence strength (β). The expected goals for a fixture are:

λ_home = exp(home_advantage + α_home − β_away)
λ_away = exp(α_away − β_home)

Goals are Poisson-distributed. The Dixon-Coles correction (ρ) adjusts the joint probability of low-scoring outcomes (0–0, 0–1, 1–0, 1–1) to account for the empirical dependence between home and away goals.

Key parameters:

Parameter	Value	Notes
Half-life	1,825 days (5 years)	Longer than club football (~18 months) because international matches are ~5× sparser per team
Training window	10 years	Matches older than this are excluded
Min matches per team	20	Teams with fewer matches are dropped from the fit
Max goals cap	10	Poisson PMF truncated here
K-factor weighting	Yes	Tournament and qualifier matches weighted higher than friendlies

Fit procedure: Maximum likelihood estimation via L-BFGS-B with a sum-to-zero identifiability constraint on attack parameters. Training corpus: ~9,200 international results (within the 10-year window). The fit runs on data strictly before the prediction date — no look-ahead.

Component model 3 — Hierarchical Poisson (MAP)

Same goal-scoring likelihood as Dixon-Coles, but with Gaussian priors that partially pool team strengths toward a global mean. This is the Baio & Blangiardo (2010) specification.

Priors:

Parameter	Prior
Attack (α_i)	N(0, 0.5²)
Defence (β_i)	N(0, 0.5²)
Home advantage	N(0.3, 0.2²)
ρ (DC correction)	N(0, 0.1²)

Key differences from Dixon-Coles:

The Gaussian prior means teams with fewer matches are shrunk toward the global mean rather than being dropped. This lets the model estimate strengths for teams with as few as 10 matches (vs DC's minimum of 20).
No sum-to-zero constraint needed — the prior identifies the model.
Production uses MAP (maximum a posteriori) point estimates. A full NUTS posterior via PyMC is available as a parallel path for credible intervals but is not in the ensemble predict loop.

The ensemble

The three component models are combined into a single probability per fixture.

Combination method: Uniform average. Each component gets equal weight (1/3 each). If a component cannot produce a prediction for a fixture (e.g., the team is missing from DC's training set), it is dropped and the remaining components split the weight equally.

Why uniform weights? Bayesian stacking weights were fitted (Yao et al. 2018, LOO log-score optimisation). The result: DC 63%, Elo 32%, HP 6%. But the stacking ensemble did not beat uniform averaging on the walk-forward gate (median Brier 0.4954 vs 0.4941 for uniform). The per-walk weights were also unstable (Elo weight ranged 0.24–0.51 across walks), confirming a flat optimum. Uniform averaging ships because it's more robust.

Extremization (d = 1.15)

After pooling, probabilities are pushed away from the uniform prior (1/3) in log-odds space:

lo_k = log(p_k / prior)    for each class k ∈ {H, D, A}
elo_k = d × lo_k
p_k_new ∝ prior × exp(elo_k)

This corrects for the underconfidence inherent in linear probability pools (Ranjan & Gneiting, 2010). A calibrated average of calibrated forecasters is necessarily underconfident — extremization is the standard fix.

Backtest validation: 8-fold walk-forward, tournament matches only (n=311):

d=1.00 (no extremization): Brier 0.5084
d=1.15: Brier 0.5063 (−21 basis points)
Improvement is monotonic from d=1.00 to d=1.30; d=1.15 is a conservative choice from the literature.

Per-tier calibration

After extremization, the ensemble applies a calibration adjustment that depends on the match type:

Tier	Method	n (training)	Platt temperature
Friendly	Isotonic (PAV)	411	T = 0.863
Qualifier	Isotonic (PAV)	457	T = 0.917
Tournament	Platt scaling	70	T = 1.157

The tournament tier uses Platt temperature scaling (a single parameter) rather than isotonic calibration because n=70 tournament matches is too few for stable isotonic curves. A preference override triggers automatically when the tournament tier has < 200 training matches.

Calibration ECE (5-fold cross-validated):

Uncalibrated: 4.12pp
After per-tier calibration: 3.49pp (isotonic), 3.73pp (Platt)
Tournament tier specifically: 11.71pp uncalibrated → 9.38pp (isotonic) / 10.71pp (Platt)

Goalkeeper defence offset

The starting goalkeeper's quality adjusts the expected goals conceded:

λ_home_adj = λ_home × exp(−α × centred_rating[away_gk])
λ_away_adj = λ_away × exp(−α × centred_rating[home_gk])

Ratings are centred on the mean across all 48 WC teams (zero-mean offset). α = 0.05, selected as the grid-search winner across {0.0, 0.001, 0.005, 0.01, 0.02, 0.05} — the improvement was monotone.

Gate result: passed. 8×90d walk-forward: median Brier 0.49398 vs 0.49409 baseline (+1.16 basis points).

Monte Carlo bracket simulator

The ensemble probabilities feed into a full simulation of the 2026 FIFA World Cup bracket.

Key parameters:

Parameter	Value
Simulations	50,000
Goal baseline	2.5 goals/match
Elo–goal scale	0.0015 per Elo point
Third-place advance	8 of 12
Third-place combinations	495
PK resolution	Markov model (Model 15) when available; 50/50 fallback

Group-stage draw factor: The raw ensemble draw probability is multiplied by 1.05 before group-stage simulations, then renormalised. This corrects for the empirical observation that group-stage matches at World Cups produce slightly more draws than the model's base rate predicts. The factor was selected by Brier-minimising sweep on 692 historical WC group-stage matches.

Confidence intervals: 50 bootstrap snapshots × 5,000 sims each. The bootstrap resamples the training data, refits all component models, and re-runs the bracket. The published CI is the 5th–95th percentile across bootstrap snapshots.

Backtest performance

8-fold walk-forward, 90-day windows, tournament matches only (n=311 common-subset matches across 7 valid folds):

Model	Brier (mean)	Brier (weighted)	Log-loss (weighted)
Elo	0.562	0.513	0.972
Dixon-Coles	0.569	0.513	0.876
Hierarchical Poisson	0.571	0.516	0.880
Ensemble	0.557	0.506	0.864

The ensemble beats every component on both Brier and log-loss. The margin is modest — the value is in robustness across regimes, not in any single fold.

A caveat on these numbers — and the canonical figure. This walk-forward composes its ensemble from the current Elo snapshot rather than each team's rating as it stood on the match date. For a recent evaluation window that leaks a sliver of future information backward, which is why the recency-weighted Brier (0.506) reads optimistically low — note that the unweighted fold mean (0.557) is already much closer to the truth. The fully leakage-free figure, with Elo rolled forward and every layer refit pre-kickoff across 987 major-tournament matches from 2014–2024, is Brier 0.572 (ECE 5.6pp). That is the canonical out-of-sample tournament number; the full per-tournament breakdown is on the calibration scoreboard.

Uniform 3-class baseline: Brier = 0.667. The leakage-free ensemble's 0.572 is about a 14% improvement over that baseline.

What's not in the model

Several features were tested and rejected:

Set-piece-aware Dixon-Coles: Brier +0.0079 worse. Not shipped.
Style-matchup pair effects: Did not clear the walk-forward gate. Not shipped.
Confederation-pooled Hierarchical Poisson: Overall Brier regressed −57bp despite per-tier improvements. Not shipped.
Bayesian stacking weights: Did not beat uniform averaging on median Brier. Not shipped.
HistGradientBoosting meta-learner: Median Brier 0.534 vs uniform 0.498 — worse on every walk. Not shipped.

Negative results stay on the record. They constrain the design space and prevent re-running experiments that have already been run.

Model version history

#	Component	Status
1	Elo + MC bracket simulator	Shipping
2	Dixon-Coles MLE with time decay	Shipping
3	Hierarchical Poisson MAP	Shipping
3b	Confederation-pooled HP	Tested, not shipped
4	Player composite rating (FBref per-90 + TM valuation)	Data pipeline
4b	Goalkeeper-specific PSxG rating	Data pipeline
5	Tournament-scorer probability	Shipping (anytime scorer)
14	Set-piece-aware DC	Tested, not shipped
15	PK proficiency Markov model	Shipping
16	Player-composite-differential offset (α=0.05)	Shipping
17	Style-matchup pair effects	Tested, not shipped
18	Starting-GK defence offset (α=0.05)	Shipping

Numbers in this post are pinned to the May 28 model state. The methodology docs at /docs/methodology/ carry the canonical version.

All numbers are model outputs. They are for research and educational purposes only — not betting advice, not financial advice, not recommendations to gamble. Methodology: /docs/methodology/. Full Terms of Use.

Model notes — how onthepitch predicts the 2026 World Cup

Architecture overview

Component model 1 — Elo

Component model 2 — Dixon-Coles with time decay

Component model 3 — Hierarchical Poisson (MAP)

The ensemble

Extremization (d = 1.15)

Per-tier calibration

Goalkeeper defence offset

Monte Carlo bracket simulator

Backtest performance

What's not in the model

Model version history

See the live forecast

See how the forecast holds up