What 54 World Cup matches taught the model

The group stage is almost over. Fifty-four of 72 matches have been played, 18 remain. Every probability the model published before kickoff has been scored against the result. This is the full accounting: what the model got right, what it got wrong, and what we tried to fix along the way.

The headline numbers

Metric	Value	Context
Overall Brier score	0.541	Lower is better. Range 0 to 2.
Pre-tournament backtest (987 matches, 2014-2024)	0.572	The model is outperforming its own history
Uniform 1/3 guess (no-skill baseline)	0.667	Any calibrated model should beat this
Log loss	0.884	Secondary metric, penalises confident misses more
Expected calibration error	20.9%	Much higher than the 5.6% from the backtest

The Brier score, 0.541, is encouraging. The model is performing better than its own two-year backtest predicted, and well below the no-skill baseline. At the same time, the expected calibration error (ECE) of 20.9% is far worse than the 5.6% the walk-forward backtest produced. Those two facts are not contradictory: the Brier score rewards getting the direction right even when the magnitudes are off, while ECE specifically measures whether the stated probability matches the observed frequency. The model is picking the right outcomes more often than expected, but the probabilities it attaches to them are poorly sized in certain bins.

A Brier of 0.541 across 54 matches is a respectable group-stage performance. It is not excellent. The model sits somewhere between "meaningfully better than guessing" and "close to the best available public forecasts." The honest read: the group stage played out roughly as the model's historical record suggested it would, with one systematic exception that dominated the early matches.

Two phases of the group stage

The overall Brier hides a story. The tournament had two distinct phases for the model, and the boundary falls between matchday 1 and matchday 2.

Phase	Matches	Mean Brier	Draws
Matchday 1 (June 11-17)	24	0.660	9 (37.5%)
Matchday 2 (June 18-23)	24	0.454	5 (20.8%)
Matchday 3 (June 24, partial)	6	0.410	0 (0%)

The model's matchday 1 Brier of 0.660 is barely better than the 0.667 no-skill baseline. A reader scoring the model after the first 24 matches would have been right to question whether it added any value at all. By matchday 2, the Brier had dropped 31% to 0.454, comfortably beating the backtest. The six matchday 3 results so far are better still.

Per-day Brier scores tell the same story. The model's best day was June 16 (Brier 0.197): France 3-1 Senegal, Norway 4-1 Iraq, Argentina 3-0 Algeria, Austria 3-1 Jordan, all favourites winning decisively. Its worst day was June 15 (Brier 1.148): Spain drew 0-0 with Cape Verde, Saudi Arabia drew 1-1 with Uruguay, Belgium drew 1-1 with Egypt, and Iran drew 2-2 with New Zealand. All four were draws. The model had the draw probability below 24% in every one of them.

June 15 was the day the model's draw blind spot cost it most. And the draw blind spot is the dominant story of the group stage.

The draw blind spot

Across 54 matches, the model predicted an average draw probability of 22.8%. The observed draw rate was 25.9% (14 of 54). Across the full group stage, those numbers are close enough to pass a rough calibration check. But the aggregate conceals a severe timing problem.

On matchday 1, 37.5% of matches drew. The model predicted an average draw probability of about 23% for those same matches. On matchday 2, the draw rate fell to 20.8%, closer to the model's predictions. On matchday 3, no match has yet drawn.

The matchday 1 draw rate, 37.5%, is far above the historical World Cup group-stage average. Across the full 54 matches, nine predictions scored a Brier above 1.0 (worse than a uniform guess). Seven of the nine are draws; two are upsets where a lower-rated team won:

Match	Model's expected outcome	Brier
Spain 0-0 Cape Verde	Spain 87.8%	1.572
Curaçao 0-0 Ecuador	Ecuador 82.1%	1.397
Qatar 1-1 Switzerland	Switzerland 79.7%	1.353
England 0-0 Ghana	England 79.2%	1.322
Portugal 1-1 DR Congo	Portugal 75.1%	1.220
Cape Verde 2-2 Uruguay	Uruguay 72.8%	1.151
South Africa 1-0 South Korea	South Korea 57.5%	1.093
Saudi Arabia 1-1 Uruguay	Uruguay 69.3%	1.073
Ivory Coast 1-0 Ecuador	Ecuador 53.4%	1.058

Every match on this list either drew when the model expected a decisive result, or produced an upset where the model's favoured side lost. All nine have Brier scores above 1.0, meaning the model performed worse than if it had assigned equal probability to all three outcomes. Seven of the nine occurred during matchday 1 and 2, driving the early-tournament Brier toward the no-skill baseline.

The draw problem has a structural cause. The model's calibrator was fit on 24 months of major-tournament matches from 2014 to 2024. Those tournaments were played on a mix of home-nation venues and neutral sites, but the underlying data includes genuine home and away advantages. The 2026 World Cup is played entirely on neutral ground in the United States, Canada, and Mexico. Every match is officially neutral. This collapses the distinction between home and away that the model's Dixon-Coles layer depends on: the home advantage parameter, which pushes probability mass away from draws and toward the designated "home" side, is miscalibrated when no real home advantage exists.

The evidence is stark. Across all 54 matches, the model predicted away wins at an average of 30.2%. The actual away-win rate was 13.0% (7 of 54). Meanwhile, home-listed teams won 61.1% of matches against a predicted 47.0%. The probability mass that should have gone to draws was being split between home and away, with the away side receiving the larger share of the error, because the model overestimated the extent to which the "away" team was disadvantaged at neutral venues.

We diagnosed this during the tournament and investigated a simple multiplicative draw-boost calibration. The optimal boost was +9 percentage points on draw probability, which would have reduced the 40-match Brier from 0.598 to 0.584. But the adjustment failed leave-one-out validation (Brier worsened from 0.598 to 0.615 under LOO) because the draw rate was not uniform across matches: draws were under-predicted in matches with strong favourites but over-predicted in balanced matches. A flat multiplicative factor overfits to the former and worsens the latter. The fix needs a non-linear or binned calibration, which needs more data than 40 matches can provide.

We did not ship the draw adjustment. The model's published probabilities for matchday 3 used the same calibrator as matchday 1. This is a deliberate choice: shipping a correction that fails cross-validation would improve the in-sample numbers while degrading robustness. The draw problem is acknowledged, measured, and documented. It is not yet solved.

Heavy favourites and overconfidence

The calibration segments tell a related story. Looking at the reliability bins:

Predicted probability	Mean prediction	Observed frequency	n
70-80%	75.3%	55.6%	9
80-90%	85.1%	33.3%	3
90-100%	90.4%	100.0%	2

The 90%+ bin is perfect but tiny (Germany 90.0% vs Curaçao, Brazil 90.8% vs Haiti, both won). The 70-80% bin tells the real story: the model predicted outcomes at 75% confidence nine times, and those outcomes happened 56% of the time. That is a 20-point gap between prediction and reality.

The three events in the 80-90% bin are worse: a 33% hit rate on 85% confidence predictions. Those three are Spain vs Cape Verde (87.8%, drew), Ecuador vs Curaçao (82.1%, drew), and Spain vs Saudi Arabia (85.4%, won). Two of three were draws against minnows.

This is the same draw problem, seen through a different lens. When the model assigns 85% to a heavy favourite, it assigns roughly 5-10% to the draw. The actual draw rate in those matches has been far higher than 5-10%. The model's favourite-win probabilities are not wrong in direction (these teams are stronger), but they overstate the margin by compressing the draw probability.

In the more moderate favourite bands, the model is better calibrated. The 50-60% bin (13 predictions, 54% hit rate) and the 40-50% bin (10 predictions, 40% hit rate) are both close to their predicted means. The model's discrimination in balanced matches is sound. The calibration problem is concentrated at the extremes, where the neutral-venue effect and the draw compression interact.

The improvement: why matchday 2 was different

The model improved sharply from matchday 1 to matchday 2 (0.660 to 0.454 Brier, a 31% reduction). Three factors explain the shift.

First, the draw rate dropped. Matchday 1 produced draws at 37.5%. Matchday 2 produced them at 20.8%. The model's draw blind spot cost it less when there were fewer draws to miss. This is partly mechanical (fewer opportunities for the model's worst failure mode) and partly structural (matchday 2 teams know what they need from the table and play more decisively).

Second, clear favourites delivered. Matchday 2 included France 3-0 Iraq, Spain 4-0 Saudi Arabia, Portugal 5-0 Uzbekistan, Netherlands 5-1 Sweden, Brazil 3-0 Haiti, and Canada 6-0 Qatar. These are matches where the model assigned high win probabilities and the results confirmed them. On matchday 1, several of those same favourites (Spain, Portugal, Belgium) drew instead of winning, producing the worst Brier scores.

Third, the model's Elo ratings were updated. After matchday 1 results, the tournament Elo ratings shifted. The USA gained 49.8 Elo points (the largest single-team gain), reflecting their 4-1 win over Paraguay. Turkey lost 50.5 points (the largest drop). These shifts, while not fed back into the published pre-tournament probabilities (which were frozen), informed the model's internal ratings and the new features added during the tournament: player composites blended with video-analysis ratings, referee tendency adjustments, and travel/rest factors.

The matchday 3 partial data (6 matches, Brier 0.410) continues the trend, with every result going to the team the model favoured.

Five pre-tournament contrarian calls, scored

Before the tournament we published five places the model disagreed with the consensus. Here is how each call has held up through 54 matches.

1. Ecuador rated above Germany. The model's Elo had Ecuador at 1933, Germany at 1923. Fourteen days into the tournament, Ecuador have a draw and a loss from Group E (1-0 to Ivory Coast, 0-0 with Curaçao). Germany have two wins (7-1 vs Curaçao, 2-1 vs Ivory Coast). The two play each other today. The model's read on schedule strength was not unreasonable, but the on-pitch results have not supported the claim. Verdict: not validated.

2. Raphinha as the #1 anytime scorer. Brazil have won two and drawn one, scoring seven goals in the process. Raphinha's tournament-scorer ranking depends on individual goal tallies we will assess after the group stage is complete. The tournament path depth the model predicted (5.6 matches for Brazil) is tracking plausibly with Brazil looking strong in Group C. Verdict: in progress.

3. Iran at 81% to advance. After two draws (2-2 New Zealand, 0-0 Belgium), Iran sit on 2 points in a group where Egypt lead with 4 points and Belgium also have 2. Iran's qualification depends on their matchday 3 result against Egypt. The 81% looked comfortable pre-tournament; the group has been more competitive than expected, with Belgium's golden generation drawing both matches. Verdict: tight, not yet resolved.

4. Spain and Argentina pulling away from FIFA-implied rates. Argentina have won both matches (3-0 Algeria, 2-0 Austria) and Messi has reached 18 career World Cup goals, the most in the history of the tournament. Spain stumbled with the 0-0 Cape Verde draw but responded with a 4-0 win over Saudi Arabia. Both teams are through or close to it. The model's confidence in them looks justified by what has followed. Verdict: validated.

5. The USA as the underdog in every home group match. The model gave the USA a 32.2% win probability against Paraguay (they won 4-1), 31.0% against Australia (they won 2-0), and 30.8% against Turkey (tonight). Two wins from two, both convincing, as the designated underdog. The model correctly identified that the USA's Elo did not rate them as favourites, and the USA proved that the Elo was wrong. Verdict: the model's probabilities were directionally wrong, the USA are better than their rating suggested, and the pre-tournament call to flag this gap was useful.

The scorecard: one call validated, one not validated, one directionally wrong but usefully flagged, and two still in play. That is a realistic hit rate for contrarian positions.

What we built during the tournament

The model's pre-tournament probabilities were frozen, but the infrastructure around them was not. During the group stage we built and shipped several new components, each through the same gate-then-ship discipline used before the tournament.

In-tournament Elo updates. Standard Elo updates (K=40, neutral venue) on all played fixtures, producing a running Elo that reflects tournament performance. The USA was the biggest gainer (+49.8 points after two decisive wins). Turkey was the biggest loser (-50.5 after losing to Australia and Paraguay).

Player composite updates from video data. A pipeline that watches every match's highlight reel using Gemini and scores every visible player on a 1-10 scale across seven tactical dimensions (chance creation, pressing defence, shot quality, among others). After each matchday, these video ratings were blended with pre-tournament player composites using a match-count-scaled alpha (10% per match, max 40%). This produced an updated player-quality surface that reflected actual tournament performance rather than pre-tournament projections.

Referee and environmental adjustments. Bayesian-shrunk lambda adjustments for referee tendencies (penalty-giving rate, card strictness) and asymmetric travel/rest factors (long-haul travel, timezone shift, rest-day deficit), plus expanded weather factors (wind, precipitation, humidity interaction with heat). These fed into the predict-time goal rate as multiplicative adjustments on the Dixon-Coles lambda.

Matchday 3 forward test. Before any matchday 3 results, we froze seven prediction variants for all 24 remaining fixtures: baseline, draw-boosted (+8%), video-adjusted, form-adjusted, and three combinations. Each variant includes 200 placebo trials (shuffled deltas across teams) to establish whether any signal beats random at the 90th percentile. After matchday 3 results are complete, we can score all seven variants and determine which (if any) adjustment mechanisms carry genuine signal.

What didn't survive testing

Not all of these experiments produced signal. Three results are worth reporting.

Video data: no signal on round one, a flicker on round two. We ran two forward tests. In the first, matchday 1 video ratings were used to adjust matchday 2 predictions via log-odds shifts (scale 0.04 per unit delta). The placebo gate placed the real video-adjusted predictions at the 43rd percentile of 200 random shuffles. Below the 90th percentile threshold, below even the 50th. The video signal was too small (average 0.004 shift per match) to be distinguishable from noise across 24 matches.

The second forward test, matchday 2 to matchday 3, told a slightly different story. With two matches of video data per team instead of one, and clearer quality separation between teams, the video-only adjustment was the best-performing variant on the six scored matchday 3 fixtures:

Variant	Mean Brier	vs Baseline
video_only	0.409	-2.7bp
form_only	0.410	-2.5bp
baseline	0.412	--
draw_boost	0.486	+74bp

The video adjustment's largest single contribution came from Bosnia 3-1 Qatar. Qatar's video delta was -1.435, the most negative in the tournament, meaning their highlight-reel performance across two matches was far below what their pre-tournament probability implied. The adjustment correctly redistributed probability away from Qatar, improving that match's Brier by 2.4 basis points. Its largest miss was Canada 1-2 Switzerland: video rated Canada positively and Switzerland negatively, but Switzerland won.

The draw boost, which looked promising on the first 40 matches (+9pp optimal), was the worst variant on matchday 3 by a wide margin. Matchday 3 produced zero draws from six matches. A correction calibrated on the high-draw opening round hurt badly when the draw rate collapsed.

An improvement of 2.7 basis points across six matches is within noise. Two forward tests, one negative and one weakly positive, do not establish that video analysis improves match prediction. The style-matchup regression on video grading dimensions (R2=0.80 on goal differential from 26 fixtures) suggests the dimensional signal may be stronger than the flat rating-delta approach, but that R2 is in-sample from a small corpus and almost certainly overfitting. We will continue running forward tests through the knockout stage. If the video adjustment survives a cumulative placebo gate across 30+ matches, it earns a place in the production model. Until then, it remains experimental.

Draw recalibration: fails cross-validation. The optimal multiplicative draw boost (+9 percentage points) improved in-sample Brier from 0.598 to 0.584 on 40 matches. Under leave-one-out validation, it worsened Brier from 0.598 to 0.615. The draw miscalibration is non-monotonic across the probability range: draws are under-predicted when the model gives the draw a low probability (under 20%) but over-predicted when the model gives the draw a moderate probability (20-30%). A flat multiplicative correction cannot fix a non-monotonic pattern. It overfits to the matches where the correction helps and degrades the matches where the baseline was already close.

Hidden motivations: confounded with matchday. We tested whether must-win and dead-rubber incentive states predict match outcomes beyond what Elo expects, using 830 group-stage matches across 23 major tournaments from 1990 to 2024. The apparent motivation signal disappeared entirely once we controlled for a single confounder: matchday. Must-win matches cluster on matchday 3, and matchday 3 is independently harder to predict. The motivation label adds zero information beyond the matchday indicator. Dead-rubber matches were actually easier to predict than standard competition (Brier delta -0.136), the opposite of the textbook hypothesis.

What the calibration reveals about the model's architecture

The systematic patterns in the data point to specific architectural limitations, some of which the project had already identified and acted on before the tournament started.

The draw under-prediction traces to the Dixon-Coles home-advantage parameter. The model was fit on a global corpus of international matches that includes genuine home and away effects (higher goal rates at home, lower away). When every match in the tournament is played on neutral ground, that asymmetry does not apply, but the model's parameters still encode it. The effect is to push probability mass toward the team listed as "home" and away from the draw. A model designed for neutral-only venues would likely have a smaller home-advantage parameter and higher baseline draw rates.

The away-win overestimation is a direct consequence. In the Dixon-Coles model, the away team gets a lower expected goal rate by exactly the home-advantage parameter. On neutral ground, neither team should receive that adjustment, but the model applies it anyway based on the fixture's home/away designation. This pushes too much probability toward the home-listed team and away from the draw.

The heavy-favourite overconfidence is harder to diagnose. One candidate, extremization, was already eliminated. A post-processing tuning sweep before the tournament found that the extremization step (which sharpens probabilities by raising them to a power d=1.15) was monotonically worsening Brier scores. Removing it gained 26.8 basis points. Production now runs at d=1.0 (no extremization). The same sweep disabled the goalkeeper defence offset (which added 0.8 basis points of noise) and the player-composite differential offset (which failed a placebo permutation test at p=0.46). Three features that had initially passed the walk-forward gate were retired when subjected to stricter validation. The model entered the tournament leaner than the 19-variants post described.

With those three features removed, the heavy-favourite overconfidence must come from the underlying components themselves: the Dixon-Coles and Hierarchical Poisson goal-rate estimates, amplified by the home-advantage parameter, producing win probabilities that are too extreme for the neutral-venue regime. The per-tier isotonic calibrator, fit on a 24-month tournament window, did not correct this because the calibration corpus includes tournaments played on home soil (Euro 2024, Copa America 2024) where genuine home advantages exist and strong favourites do win at the rates the model predicts.

These are not bugs. They are the correct behaviour of a model trained on a corpus where home advantage is real and matches are not all neutral. The World Cup is a specific regime that the model's training data includes but does not dominate. Fixing these issues without regressing on the broader corpus is the challenge for the knockout-stage predictions and for any future neutral-venue tournament.

What we know heading into the knockouts

The group stage has provided specific, actionable feedback.

The model's discrimination is sound. When the model rates a match as close to even (40-60% favourite), its hit rate is close to the predicted frequency. The Brier on balanced matches is comfortably below baseline. The model can tell strong teams from weak ones and rates close matches appropriately.

The model's calibration at the extremes needs work. Predictions above 75% win probability have been significantly overconfident, driven by draw compression and neutral-venue effects. For knockout matches, which are played on neutral ground and have extra-time and penalty provisions that compress the regulation-time draw probability differently, this will need careful attention.

The draw problem is partially self-correcting for knockouts. Knockout matches that finish level go to extra time and penalties. The model's knockout-stage probabilities condition on eventual advancement rather than regulation-time outcomes, so the draw compression issue is less directly relevant. However, the underlying goal-rate calibration (away-team lambda deflated by a home advantage that does not exist) still applies and may affect the shape of knockout predictions.

Tournament Elo has diverged from pre-tournament Elo. The USA has gained nearly 50 Elo points. Turkey has lost 50. Belgium, rated as Group G's clear top seed at 95.7% to advance, have drawn both matches and sit on 2 points. These shifts should be reflected in knockout predictions if the model is to learn from the tournament it is forecasting.

The video pipeline is producing data, but the signal is unproven. Two forward tests have been run. The first (matchday 1 to matchday 2) failed the placebo gate at the 43rd percentile. The second (matchday 2 to matchday 3) produced the best-performing variant at -2.7 basis points on six scored matches, but that margin is too small and the sample too thin to clear a placebo gate. If the cumulative forward test across the knockout stage also fails, the honest conclusion is that highlight-reel video analysis does not carry enough information density to improve match-level probability estimates, at least not at the shift scales we tested.

The model entered the tournament with an honest pre-tournament Brier estimate of 0.572. Fifty-four matches in, it has done better than that (0.541) while revealing a specific and measurable weakness. A model that beats its own baseline while openly documenting its failure modes is, in our view, earning the reader's trust the right way: through receipts, not promises.

The full match-by-match scores are on the report card. The calibration tracker updates daily at /accuracy/. The frozen predictions that these scores are graded against are in the public JSON record. The knockout stage begins soon, and so does the next round of receipts.

All metrics in this post are computed from frozen pre-match model outputs scored against observed results. The grading methodology is documented in The number we grade ourselves on. The model publishes probabilities, not recommendations. Full methodology. Full Terms of Use.