Track record

How accurate is the model?

Anyone can say a team has a 70% chance. The honest test is what happens next: do teams given a 70% chance really win about seven times in ten? Here is the one number that answers that, and every place you can check our working.

Graded against results you already know

When this model says a team has a 70% chance, that is about how often it happens. We tested it on every one of 987 matches at 24 past tournaments (2014–2024), each one graded by the model rebuilt as it stood the day before kickoff, so it never saw the result. Its stated chances landed within about 5.6 percentage points of what actually happened.

Put as a single number: on average it rated the result that actually happened about 35% more likely than a blind 1-in-3 guess would.

For the statistically minded, that is a 0.572 against the 0.667 a blind guess scores (lower is better). It is the honest yardstick for 2026, not a number flattered after the fact.

See the full scoreboard: by tournament, by confidence band, with reliability diagrams →

The 2026 tournament is being scored live as it plays. The tracker above updates per match.

Match by match

Tournament report card

Every match graded as it's played. The pre-match probability locks before kickoff; after the final whistle, it scores against the result. This is the live, match-level evidence behind the numbers above.

96/104 gradedmean 0.49754 strong calls

See every match, graded →

Check the working

Five ways the model is held to account: the evidence, the failures, and the versioned record behind every number.

Live + held-out

Calibration scoreboard

The full held-out backtest broken out by tournament and confidence band, plus the live tracker that scores every 2026 match as it's played. A 70%-rated outcome should happen about 70% of the time. This is where you check.

The argument · free

Why trust the numbers

The discipline behind the probabilities: pre-registered acceptance gates, tier-honest reporting, and the parts of the model where confidence is genuinely lower, called out by name.

Published failures

What didn't work

Every model variant that failed the ship gate, published in full with its verdict. The no-ships are as visible as the wins. If only the winners showed, the model would look more inevitable than it is.

Versioned record

Brier at every release

The versioned history of the model, each retrain stamped with its Brier-at-release, so the number on any page traces back to a dated row.

How it's built

Methodology

The component models, training procedure, data sources, and backtest design. All reproducible from public data.

Prediction integrity

Locked before kickoff

Every match forecast locks a few hours before kickoff. The locked probabilities are the final prediction the model is graded against. Once frozen, the numbers cannot change, so the calibration scores on this page reflect what was actually published before each match.