22 June 2026 · edwin-chan

We grade every prediction we publish. Here's the scorecard.

36 matches graded. 18 strong calls. 6 outright misses. The model's mean Brier is 0.598, markets are at 0.569. We built a page that holds us accountable in real time: every match gets a letter grade, the running mean updates daily, and the misses sit right next to the best calls. No hiding.

Most probability models publish their forecasts before kickoff. Very few publish what happened after.

We just launched two pages that hold our model accountable on every match of this World Cup:

  • The accuracy hub collects the held-out backtest, the live calibration tracker, published negative results, and the methodology changelog in one place.
  • The report card grades every individual match prediction from A+ to F, with the running tournament mean updated daily.

The numbers so far

36 group-stage matches scored. The headline:

MetricValue
Mean Brier score0.598
Backtest baseline (830 matches, 23 tournaments)0.642
Strong calls (Brier < 0.45)18 of 36
Misses (Brier > 1.0)6 of 36
Consensus market benchmark0.569

The model is beating its own historical baseline by a meaningful margin. It is slightly behind the consensus market aggregate. Both facts are published on the page, not one without the other.

The misses are right there

Six matches where the model was confidently wrong:

The biggest: Ecuador at 82% probability against Curacao. Dick Advocaat parked the bus and the match finished 0-0. The model's formation-blind Elo system has no representation for a 78-year-old Dutch coach running a defensive masterclass against a top-20 side.

Cape Verde 0-0 Spain. Saudi Arabia 1-1 Belgium. These sit at the top of the report card with big red F grades. If we only showed the 18 strong calls, you would have no way to calibrate whether the model is worth following. The misses are the calibration.

Why this matters

Any probability model can publish numbers. The question a reader should ask: "How do I know these are well-calibrated?"

The answer should not be "trust us." The answer should be a live scorecard, updated every day, with the specific methodology for how each match is graded, and with the embarrassing results displayed as prominently as the good ones.

That is what /accuracy/ and /report-card/ do.

What the grades mean

Each match gets a Brier score (how far the probabilities were from the actual outcome, squared) and a letter grade:

GradeBrier rangeMeaning
A+< 0.20Near-perfect confidence in the right outcome
A0.20 to 0.45Strong call, clearly above baseline
B0.45 to 0.65Respectable, around the historical mean
C0.65 to 0.85Below average, model hedged too much or backed the wrong horse
D0.85 to 1.05Poor, the model was meaningfully wrong
F> 1.05Miss. The model was confidently wrong.

The running mean, the grade distribution, and the day-by-day breakdown are all on the report card. The page updates automatically as matches are scored.

Go look

See the live forecast

This note draws on the same calibrated model that powers the full 2026 World Cup forecast — win probabilities for every fixture, projected line-ups, and the tournament-winner picture, refreshed on every run.

Explore the forecast →

See how the forecast holds up

OnThePitch grades every probability against what actually happens, and publishes the calibration in the open. Subscribe to follow the retrospectives and research as they ship — one email per post, no marketing.

Already read on Substack? Follow OnThePitch there — same posts, slightly different format.

503 字 · 发布于 22 June 2026

#- calibration