Most probability models publish their forecasts before kickoff. Very few publish what happened after.
We just launched two pages that hold our model accountable on every match of this World Cup:
- The accuracy hub collects the held-out backtest, the live calibration tracker, published negative results, and the methodology changelog in one place.
- The report card grades every individual match prediction from A+ to F, with the running tournament mean updated daily.
The numbers so far
36 group-stage matches scored. The headline:
| Metric | Value |
|---|---|
| Mean Brier score | 0.598 |
| Backtest baseline (830 matches, 23 tournaments) | 0.642 |
| Strong calls (Brier < 0.45) | 18 of 36 |
| Misses (Brier > 1.0) | 6 of 36 |
| Consensus market benchmark | 0.569 |
The model is beating its own historical baseline by a meaningful margin. It is slightly behind the consensus market aggregate. Both facts are published on the page, not one without the other.
The misses are right there
Six matches where the model was confidently wrong:
The biggest: Ecuador at 82% probability against Curacao. Dick Advocaat parked the bus and the match finished 0-0. The model's formation-blind Elo system has no representation for a 78-year-old Dutch coach running a defensive masterclass against a top-20 side.
Cape Verde 0-0 Spain. Saudi Arabia 1-1 Belgium. These sit at the top of the report card with big red F grades. If we only showed the 18 strong calls, you would have no way to calibrate whether the model is worth following. The misses are the calibration.
Why this matters
Any probability model can publish numbers. The question a reader should ask: "How do I know these are well-calibrated?"
The answer should not be "trust us." The answer should be a live scorecard, updated every day, with the specific methodology for how each match is graded, and with the embarrassing results displayed as prominently as the good ones.
That is what /accuracy/ and /report-card/ do.
What the grades mean
Each match gets a Brier score (how far the probabilities were from the actual outcome, squared) and a letter grade:
| Grade | Brier range | Meaning |
|---|---|---|
| A+ | < 0.20 | Near-perfect confidence in the right outcome |
| A | 0.20 to 0.45 | Strong call, clearly above baseline |
| B | 0.45 to 0.65 | Respectable, around the historical mean |
| C | 0.65 to 0.85 | Below average, model hedged too much or backed the wrong horse |
| D | 0.85 to 1.05 | Poor, the model was meaningfully wrong |
| F | > 1.05 | Miss. The model was confidently wrong. |
The running mean, the grade distribution, and the day-by-day breakdown are all on the report card. The page updates automatically as matches are scored.
Go look
- /accuracy/ for the full accountability hub
- /report-card/ for the match-by-match grades
- /docs/calibration/ for the technical calibration methodology