How we make predictions
In-tournament calibration
How the model's per-fixture probabilities line up against observed 2026 World Cup match outcomes. Calibration metrics only — Brier score, log-loss, ECE, reliability diagram, and per-segment Brier — measured against played matches as the tournament progresses.
Track record
Proven on past tournaments
The 2026 tracker above is empty until the first match is played. To show the model has been measured, not just described, here it is scored on tournaments whose results you already know — graded fully out of sample. Each tournament is scored by the production model reconstructed as it stood the day before the tournament's first match: Dixon-Coles and Hierarchical Poisson refit on matches strictly beforehand, Elo rolled forward to each match, and the tournament-tier calibration layer refit on the 24 months of matches before the cutoff. No data from the tournament, or any later match, touches any layer of the fit.
Out-of-sample across 987 matches at 24 major tournaments (2014–2024)
- Brier score
- 0.572
- Log loss
- 1.000
- ECE
- 5.6pp
Lower is better. Uniform 1/3 each is ≈ 0.667.
Lower is better. Uniform 1/3 each is ln(3) ≈ 1.099.
Mean |predicted − observed| across deciles.
Tournament by tournament
Each row: the model rebuilt as it stood the day before that tournament began, then scored on every fixture through the final (Brier; lower is better, uniform 1/3 is 0.667). A few low-data early editions land above that line — calibration is measured across all of them together, above.
| Tournament | Host | Matches | Brier |
|---|---|---|---|
| Copa América 2024 | United States | 32 | 0.522 |
| Euro 2024 | Germany | 51 | 0.613 |
| AFCON 2024 | Ivory Coast | 52 | 0.651 |
| Asian Cup 2024 | Qatar | 51 | 0.515 |
| Gold Cup 2023 | United States | 31 | 0.566 |
| World Cup 2022 | Qatar | 64 | 0.611 |
| AFCON 2022 | Cameroon | 52 | 0.686 |
| Gold Cup 2021 | United States | 31 | 0.341 |
| Copa América 2021 | Brazil | 28 | 0.481 |
| Euro 2021 | England | 51 | 0.554 |
| AFCON 2019 | Egypt | 52 | 0.546 |
| Gold Cup 2019 | United States | 31 | 0.405 |
| Copa América 2019 | Brazil | 26 | 0.542 |
| Asian Cup 2019 | United Arab Emirates | 51 | 0.496 |
| World Cup 2018 | Russia | 64 | 0.569 |
| Gold Cup 2017 | United States | 25 | 0.456 |
| AFCON 2017 | Gabon | 32 | 0.642 |
| Euro 2016 | France | 51 | 0.668 |
| Copa América 2016 | United States | 32 | 0.502 |
| Gold Cup 2015 | United States | 26 | 0.755 |
| Copa América 2015 | Chile | 26 | 0.686 |
| AFCON 2015 | Equatorial Guinea | 32 | 0.795 |
| Asian Cup 2015 | Australia | 32 | 0.434 |
| World Cup 2014 | Brazil | 64 | 0.565 |
Reliability diagram
Predicted probability (x) versus observed frequency (y) for each outcome — home, draw, away marginals pooled. The dashed identity line is perfect calibration; the closer the points hug it, the better the model is calibrated. Marker size reflects bin count.
Brier by favourite confidence
How the model scored across matches grouped by how confident its favourite was (the largest of the home / draw / away probabilities).
| Favourite confidence | Matches | Brier |
|---|---|---|
| P_fav < 40% | 81 | 0.649 |
| P_fav 40-60% | 476 | 0.633 |
| P_fav 60-80% | 318 | 0.512 |
| P_fav >= 80% | 112 | 0.428 |
- Out-of-sample: the calibration layer is refit per tournament on pre-tournament data, so these numbers do not reuse the live shipped calibrator (which has seen these results).
- The uniform 1/3 forecast scores a Brier of 0.667; lower is better. Major-tournament football is high-variance, so a strong model still sits well above a league-season Brier.
- Calibrated and uncalibrated metrics are reported on the same fixtures so the calibration layer's effect is visible.
Built 2026-05-30 · model 1.0.0 · calibration layer refit on the 24 months before each tournament.
2026 World Cup — live tracker
Calibration tracking will populate as matches are played. Pre-tournament backtest results are in the methodology page.
No scored matches yet
First scored match expected 2026-06-11. The tracker starts populating then — once the first fixture is played, this page will show a rolling 3-class Brier score, a per-matchday breakdown, and the headline log-loss and ECE.
Until then there is nothing to plot: the model's predictions exist (see the per-fixture pages), but there are no observed outcomes to score them against.
Reliability diagram
Needs more matches. None graded yet; the diagram populates once at least 50 are in. Until then the held-out backtest on the methodology page is the most informative calibration view.
Brier by competition
| Segment | Matches | Brier | Δ vs overall |
|---|---|---|---|
| World Cup 2026 | — | — | — |
Brier by tournament stage
| Segment | Matches | Brier | Δ vs overall |
|---|---|---|---|
| Group stage | — | — | — |
| Round of 32 | — | — | — |
| Round of 16 | — | — | — |
| Quarter-final | — | — | — |
| Semi-final | — | — | — |
| Third-place play-off | — | — | — |
| Final | — | — | — |
Brier by favourite confidence
| Segment | Matches | Brier | Δ vs overall |
|---|---|---|---|
| P_fav < 40% | — | — | — |
| P_fav 40-60% | — | — | — |
| P_fav 60-80% | — | — | — |
| P_fav >= 80% | — | — | — |
How to read these numbers
Brier score is the mean squared error between the model's probability vector and the observed outcome (one-hot encoded as home / draw / away). Range [0, 2]; lower is better; a uniform 1/3-each model scores ≈ 0.667. Log loss is the mean negative log-likelihood of the observed outcome under the model's probability; uniform 1/3-each scores ln(3) ≈ 1.099. ECE (expected calibration error) bins the home-win probability into deciles and measures the average gap between predicted and observed rates within each bin. Together they describe sharpness (Brier / log-loss) and calibration (ECE) — the two things a probabilistic model can be measured on.
For the underlying model and the held-out backtest, see the methodology page.