We grade every prediction we publish. Here's the scorecard.

Most probability models publish their forecasts before kickoff. Very few publish what happened after.

We just launched two pages that hold our model accountable on every match of this World Cup:

The accuracy hub collects the held-out backtest, the live calibration tracker, published negative results, and the methodology changelog in one place.
The report card grades every individual match prediction from A+ to F, with the running tournament mean updated daily.

The numbers so far

36 group-stage matches scored. The headline:

Metric	Value
Mean Brier score	0.598
Backtest baseline (830 matches, 23 tournaments)	0.642
Strong calls (Brier < 0.45)	18 of 36
Misses (Brier > 1.0)	6 of 36
Consensus market benchmark	0.569

The model is beating its own historical baseline by a meaningful margin. It is slightly behind the consensus market aggregate. Both facts are published on the page, not one without the other.

The misses are right there

Six matches where the model was confidently wrong:

The biggest: Ecuador at 82% probability against Curacao. Dick Advocaat parked the bus and the match finished 0-0. The model's formation-blind Elo system has no representation for a 78-year-old Dutch coach running a defensive masterclass against a top-20 side.

Cape Verde 0-0 Spain. Saudi Arabia 1-1 Belgium. These sit at the top of the report card with big red F grades. If we only showed the 18 strong calls, you would have no way to calibrate whether the model is worth following. The misses are the calibration.

Why this matters

Any probability model can publish numbers. The question a reader should ask: "How do I know these are well-calibrated?"

The answer should not be "trust us." The answer should be a live scorecard, updated every day, with the specific methodology for how each match is graded, and with the embarrassing results displayed as prominently as the good ones.

That is what /accuracy/ and /report-card/ do.

What the grades mean

Each match gets a Brier score (how far the probabilities were from the actual outcome, squared) and a letter grade:

Grade	Brier range	Meaning
A+	< 0.20	Near-perfect confidence in the right outcome
A	0.20 to 0.45	Strong call, clearly above baseline
B	0.45 to 0.65	Respectable, around the historical mean
C	0.65 to 0.85	Below average, model hedged too much or backed the wrong horse
D	0.85 to 1.05	Poor, the model was meaningfully wrong
F	> 1.05	Miss. The model was confidently wrong.

The running mean, the grade distribution, and the day-by-day breakdown are all on the report card. The page updates automatically as matches are scored.

Go look

/accuracy/ for the full accountability hub
/report-card/ for the match-by-match grades
/docs/calibration/ for the technical calibration methodology

See how the forecast holds up

OnThePitch grades every probability against what actually happens, and publishes the calibration in the open. Subscribe to follow the retrospectives and research as they ship — one email per post, no marketing.

Already read on Substack? Follow OnThePitch there — same posts, slightly different format.