The model never stops predicting. Here's the number we grade ourselves on

The 2026 World Cup starts tomorrow. Mexico and South Africa kick off at 19:00 UTC in Mexico City, sixteen years to the day after the same two teams opened the 2010 tournament in Johannesburg.

The row we will grade ourselves on for that match is already written: Mexico 64.1%, draw 24.8%, South Africa 11.1%. It entered our prediction log at 10:23 UTC this morning, alongside a row for each of the other 71 group-stage fixtures. The model will keep retraining every night of the tournament. The numbers on the fixture pages will keep moving. None of that can touch those rows.

This post explains the rule that freezes them, the score they get graded with, and the receipts that stop us from cheating.

A model that never stops predicting

The model retrains every night at 22:00 UTC, and through the tournament a live refit reruns every two hours. New results arrive, ratings shift, and every probability on the site updates. That is what a forecast should do: one that ignores new information is not principled, it is just stale.

It also creates an accountability problem. If "the model's prediction" changes by the hour, which version gets graded? A forecaster who keeps revising can always point, after full time, at whichever revision aged best. The only way out is to choose the graded number by a rule that is written down in advance and cannot be bent afterwards.

We use two rules, at two distances from kickoff.

Freeze one: the scoreboard row, 24 hours out

Every day, the model's current probabilities for every upcoming fixture are appended to a prediction log. Rows are added, never edited, never removed. When a match finishes, the calibration scoreboard grades the earliest row logged at least 24 hours before kickoff.

Both halves of that rule matter:

At least 24 hours before kickoff. A forecast made thirty minutes before kickoff, with the lineups published, is too easy to count as skill. The cutoff forces the model to commit while real uncertainty remains.
The earliest qualifying row, not the latest. A later row can never displace an earlier one, so a graded row is settled the moment it is written. Logging more often extends the record; it cannot game it.

Because this morning's snapshot covered all 72 group-stage fixtures at once, the scoreboard rows for the entire group stage are already fixed. The opener is one of the more confident ones. South Korea against Czech Republic, also in Group A, is the least: 36.7% South Korea, 26.6% draw, 36.7% Czech Republic. A coin flip the model declines to dress up as conviction gets graded by the same rule as everything else.

Freeze two: the published number locks, 3 hours out

The fixture pages keep updating after the scoreboard row freezes, and that is by design: a reader checking a match page two hours before kickoff should see the model's current belief, not yesterday's. But three hours before kickoff, that stops too. Each fixture's published probabilities are copied into an immutable public record, and the page switches to the locked values with a "final prediction" badge. The locked entry is the last thing the model believed before the match, and it is what our post-match recaps grade.

Two freezes, two jobs. The early row keeps the calibration record honest, because nothing the model learns in the final day can flatter it. The late freeze pins down what the model actually said going in. By the time you read this, the opener page may well disagree with the frozen scoreboard row by a few points. That is not a bug. The page tracks what the model believes now; the log records what it committed to then. The gap between them is the whole reason freeze rules exist.

The number: the Brier score

For every match we publish three probabilities: home win, draw, away win. After full time, the outcome that happened becomes 1 and the other two become 0. The Brier score is the sum of the squared gaps between the probabilities and that outcome. Zero is perfect. Two is the worst possible. A blind guess of one third on each outcome scores 0.667 on every match, no football knowledge required.

The opener makes it concrete. If Mexico win, our frozen row scores about 0.20. If South Africa win, it scores about 1.26. The same 64.1% that earns a good score in the first case is what makes the second one expensive. Confidence is never free, and that is the property that makes the average across many matches meaningful.

For context: on a leakage-free backtest across 24 major tournaments, from the 2014 World Cup through Euro 2024 and Copa América 2024, the model scored 0.572 over 987 matches it had never seen, against the 0.667 blind-guess baseline. International tournament football is low-scoring and draw-prone, so nobody gets near zero. The claim is not "the model is right". The claim is: measurably better than guessing, by a stated amount, on matches it could not have memorised. The 2026 number will land where it lands, and it will land in public.

The public JSON record

Three machine-readable files on this site carry the record. All three are live right now, and all three are nearly empty:

/frozen_predictions.json holds the locked final predictions. Today it contains no fixtures; the first entry lands three hours before the opener.
/in_tournament_metrics.json holds the running Brier score across played matches, a rolling ten-match window, and a per-matchday table. Today it reports zero matches scored and a note that the first is expected on June 11.
/calibration_segments.json holds the reliability bins (of the matches where we said about 70%, how many happened?) and splits by stage. Every metric in it is currently null.

The empty state is deliberate. The files, their schema, and their URLs exist before a single ball is kicked, published at the one moment they cannot contain anything flattering. If you would rather verify than trust, fetch them today, keep your copies, and diff them against what they say after matchday 1.

Each file also explains itself from the inside. A provenance note in the payload states what the numbers are, the rule that chose the graded forecast, how the Brier score is computed, and where the methodology and the archive receipts live. A copy you save today still carries its own context wherever it ends up.

The human-readable version lives at /accuracy/, with the full scoreboard at /docs/calibration/.

Receipts we cannot edit

An append-only log that we ourselves maintain still has a hole in it: you would have to take our word that we never rewrote history. So there is a second layer we do not control. Forecast pages are captured by the Internet Archive's Wayback Machine while their matches are still strictly in the future. A daily job walks the upcoming fixtures and archives any forecast page not yet captured; a page that is already archived is never re-captured, so the earliest timestamp is the one that stands. Twelve captures exist so far, the first taken on June 3, and each archived fixture page shows an "archived before the result" badge linking straight to its capture.

The Internet Archive is a public library. We cannot edit its timestamps, and that is exactly what makes the badge worth showing. If our published probabilities ever disagreed with what the archive recorded before kickoff, anyone could prove it in two clicks.

What to watch

The group stage runs through June 27: 72 matches, 72 graded rows. The scoreboard will carry the running Brier score, the rolling window, and the per-matchday table, updated daily, good or bad. Over 72 matches there will almost certainly be a stretch where the model looks clueless and a stretch where it looks clairvoyant. Both will sit on the same page, scored by the same frozen rows.

The full forecast for all 48 teams is at /forecast/. The graded record starts tomorrow at /accuracy/: frozen at least 24 hours before each kickoff, locked again at three hours, scored in public, and archived by a library we do not control.

All numbers in this post are model outputs as of the June 10 prediction-log snapshot. They are for research and educational purposes only: not betting advice, not financial advice, not recommendations to gamble. The model can be wrong. Methodology: /docs/methodology/. Full Terms of Use.