How we make predictions

In-tournament calibration

How the model's per-fixture probabilities line up against observed 2026 World Cup match outcomes. Calibration metrics only — Brier score, log-loss, ECE, reliability diagram, and per-segment Brier — measured against played matches as the tournament progresses.

Track record

Proven on past tournaments

The 2026 tracker above is empty until the first match is played. To show the model has been measured, not just described, here it is scored on tournaments whose results you already know — graded fully out of sample. Each tournament is scored by the production model reconstructed as it stood the day before the tournament's first match: Dixon-Coles and Hierarchical Poisson refit on matches strictly beforehand, Elo rolled forward to each match, and the tournament-tier calibration layer refit on the 24 months of matches before the cutoff. No data from the tournament, or any later match, touches any layer of the fit.

Out-of-sample across 987 matches at 24 major tournaments (2014–2024)

Brier score
0.572

Lower is better. Uniform 1/3 each is ≈ 0.667.

Log loss
1.000

Lower is better. Uniform 1/3 each is ln(3) ≈ 1.099.

ECE
5.6pp

Mean |predicted − observed| across deciles.

Tournament by tournament

Each row: the model rebuilt as it stood the day before that tournament began, then scored on every fixture through the final (Brier; lower is better, uniform 1/3 is 0.667). A few low-data early editions land above that line — calibration is measured across all of them together, above.

TournamentHostMatchesBrier
Copa América 2024United States320.522
Euro 2024Germany510.613
AFCON 2024Ivory Coast520.651
Asian Cup 2024Qatar510.515
Gold Cup 2023United States310.566
World Cup 2022Qatar640.611
AFCON 2022Cameroon520.686
Gold Cup 2021United States310.341
Copa América 2021Brazil280.481
Euro 2021England510.554
AFCON 2019Egypt520.546
Gold Cup 2019United States310.405
Copa América 2019Brazil260.542
Asian Cup 2019United Arab Emirates510.496
World Cup 2018Russia640.569
Gold Cup 2017United States250.456
AFCON 2017Gabon320.642
Euro 2016France510.668
Copa América 2016United States320.502
Gold Cup 2015United States260.755
Copa América 2015Chile260.686
AFCON 2015Equatorial Guinea320.795
Asian Cup 2015Australia320.434
World Cup 2014Brazil640.565

Reliability diagram

Predicted probability (x) versus observed frequency (y) for each outcome — home, draw, away marginals pooled. The dashed identity line is perfect calibration; the closer the points hug it, the better the model is calibrated. Marker size reflects bin count.

Reliability diagramReliability diagram: predicted probability on the x-axis, observed frequency on the y-axis, binned in deciles across [0, 1]. Closer to the identity line means better-calibrated.0.000.250.500.751.000.000.250.500.751.00n=327n=478n=911n=329n=269n=217n=184n=134n=73n=39predicted probabilityobserved frequency

Brier by favourite confidence

How the model scored across matches grouped by how confident its favourite was (the largest of the home / draw / away probabilities).

Favourite confidenceMatchesBrier
P_fav < 40%810.649
P_fav 40-60%4760.633
P_fav 60-80%3180.512
P_fav >= 80%1120.428
  • Out-of-sample: the calibration layer is refit per tournament on pre-tournament data, so these numbers do not reuse the live shipped calibrator (which has seen these results).
  • The uniform 1/3 forecast scores a Brier of 0.667; lower is better. Major-tournament football is high-variance, so a strong model still sits well above a league-season Brier.
  • Calibrated and uncalibrated metrics are reported on the same fixtures so the calibration layer's effect is visible.

Built 2026-05-30 · model 1.0.0 · calibration layer refit on the 24 months before each tournament.

2026 World Cup — live tracker

Calibration tracking will populate as matches are played. Pre-tournament backtest results are in the methodology page.

No scored matches yet

First scored match expected 2026-06-11. The tracker starts populating then — once the first fixture is played, this page will show a rolling 3-class Brier score, a per-matchday breakdown, and the headline log-loss and ECE.

Until then there is nothing to plot: the model's predictions exist (see the per-fixture pages), but there are no observed outcomes to score them against.

Reliability diagram

Needs more matches. None graded yet; the diagram populates once at least 50 are in. Until then the held-out backtest on the methodology page is the most informative calibration view.

Brier by competition

SegmentMatchesBrierΔ vs overall
World Cup 2026

Brier by tournament stage

SegmentMatchesBrierΔ vs overall
Group stage
Round of 32
Round of 16
Quarter-final
Semi-final
Third-place play-off
Final

Brier by favourite confidence

SegmentMatchesBrierΔ vs overall
P_fav < 40%
P_fav 40-60%
P_fav 60-80%
P_fav >= 80%

How to read these numbers

Brier score is the mean squared error between the model's probability vector and the observed outcome (one-hot encoded as home / draw / away). Range [0, 2]; lower is better; a uniform 1/3-each model scores ≈ 0.667. Log loss is the mean negative log-likelihood of the observed outcome under the model's probability; uniform 1/3-each scores ln(3) ≈ 1.099. ECE (expected calibration error) bins the home-win probability into deciles and measures the average gap between predicted and observed rates within each bin. Together they describe sharpness (Brier / log-loss) and calibration (ECE) — the two things a probabilistic model can be measured on.

For the underlying model and the held-out backtest, see the methodology page.

The metrics on this page are the model's own calibration, scored against observed match outcomes — the predicted probabilities compared to what actually happened on the pitch.