Verificando nosso trabalho
As probabilidades se confirmam?
Quando o modelo diz que uma seleção tem 70% de chance, essa seleção deveria vencer cerca de sete em cada dez vezes. Esta página verifica se isso acontece. Cada jogo da Copa do Mundo 2026 é avaliado no momento em que é disputado, e os números abaixo fazem uma pergunta: as probabilidades são honestas, não apenas confiantes? (O nome técnico para isso é calibração.)
Track record
Proven on past tournaments
The short version: when the model says 70%, it happens about 70%. Across these 987 matches its stated chances landed within ~5.6 points of reality, and on average it rated the actual result about 35% more likely than a blind 1-in-3 guess would.
The 2026 tracker below stays empty until the first match kicks off. So to show the model has been tested, not just described, we ran it against tournaments whose results you already know. For each one, the model was rebuilt exactly as it stood the day before kickoff, then graded on every match — it never sees the result it is being marked on. That is what “graded out of sample” means: no peeking, no hindsight.
Each tournament is scored by the production model reconstructed as it stood the day before the tournament's first match: Dixon-Coles and Hierarchical Poisson refit on matches strictly beforehand, Elo rolled forward to each match, and the tournament-tier calibration layer refit on the 24 months of matches before the cutoff. No data from the tournament, or any later match, touches any layer of the fit.
Graded across 987 matches at 24 major tournaments (2014–2024)
- 0.572
- 1.000
- 5.6pp
How close the forecasts landed to reality. Lower is better; blind 1-in-3 guessing scores ≈ 0.667.
Like Brier, but overconfidence is punished harder. Lower is better; blind guessing scores ≈ 1.099.
Does “70%” really mean 70%? The average gap between the two. Lower is better.
Tournament by tournament
One row per tournament: the model rebuilt as it stood the day before it began, then graded on every match through the final (Brier — lower is better, blind 1-in-3 guessing is 0.667). A few thin, early editions sit above that line; the honest measure is all of them pooled, in the box above.
| Tournament | Host | Matches | Brier |
|---|---|---|---|
| Copa América 2024 | United States | 32 | 0.522 |
| Euro 2024 | Germany | 51 | 0.613 |
| AFCON 2024 | Ivory Coast | 52 | 0.651 |
| Asian Cup 2024 | Qatar | 51 | 0.515 |
| Gold Cup 2023 | United States | 31 | 0.566 |
| World Cup 2022 | Qatar | 64 | 0.611 |
| AFCON 2022 | Cameroon | 52 | 0.686 |
| Gold Cup 2021 | United States | 31 | 0.341 |
| Copa América 2021 | Brazil | 28 | 0.481 |
| Euro 2021 | England | 51 | 0.554 |
| AFCON 2019 | Egypt | 52 | 0.546 |
| Gold Cup 2019 | United States | 31 | 0.405 |
| Copa América 2019 | Brazil | 26 | 0.542 |
| Asian Cup 2019 | United Arab Emirates | 51 | 0.496 |
| World Cup 2018 | Russia | 64 | 0.569 |
| Gold Cup 2017 | United States | 25 | 0.456 |
| AFCON 2017 | Gabon | 32 | 0.642 |
| Euro 2016 | France | 51 | 0.668 |
| Copa América 2016 | United States | 32 | 0.502 |
| Gold Cup 2015 | United States | 26 | 0.755 |
| Copa América 2015 | Chile | 26 | 0.686 |
| AFCON 2015 | Equatorial Guinea | 32 | 0.795 |
| Asian Cup 2015 | Australia | 32 | 0.434 |
| World Cup 2014 | Brazil | 64 | 0.565 |
Reliability diagram
Read it like this: each dot is a group of similar forecasts — left-to-right is what the model predicted, bottom-to-top is how often it actually happened. When the model says 70% and that happens about 70% of the time, the dot sits on the dashed line: perfect calibration. The closer the dots hug the line, the more honest the probabilities; bigger dots mean more matches in that group.
Brier by favourite confidence
Matches grouped by how confident the model's favourite was (its biggest of the home / draw / away probabilities) — so you can see whether it is as reliable on toss-ups as on heavy favourites.
| Favourite confidence | Matches | Brier |
|---|---|---|
| P_fav < 40% | 81 | 0.649 |
| P_fav 40-60% | 476 | 0.633 |
| P_fav 60-80% | 318 | 0.512 |
| P_fav >= 80% | 112 | 0.428 |
- Out-of-sample: the calibration layer is refit per tournament on pre-tournament data, so these numbers do not reuse the live shipped calibrator (which has seen these results).
- The uniform 1/3 forecast scores a Brier of 0.667; lower is better. Major-tournament football is high-variance, so a strong model still sits well above a league-season Brier.
- Calibrated and uncalibrated metrics are reported on the same fixtures so the calibration layer's effect is visible.
Built 2026-05-30 · model 1.0.0 · calibration layer refit on the 24 months before each tournament.
Copa do Mundo 2026 — acompanhamento ao vivo
Live grading begins once matches are played. Until then, the scoreboard above already grades the model on past tournaments, and the full method is on the methodology page.
No scored matches yet
First scored match expected 2026-06-11. Once the first match is played, this page grades the model in real time: a running accuracy chart and day-by-day breakdown, updated after every game.
Nothing to show yet: the forecasts already exist on each match page, but no games have been played to test them against. For the record so far, the scoreboard above grades the model on past tournaments.
Reliability diagram
Not enough matches yet to draw this. None graded yet; it appears once at least 50 are in. Until then, the scoreboard above already shows this same chart for past tournaments, and the full backtest is on the methodology page.
Brier by competition
| Segment | Matches | Brier | Δ vs overall |
|---|---|---|---|
| World Cup 2026 | — | — | — |
Brier by tournament stage
| Segment | Matches | Brier | Δ vs overall |
|---|---|---|---|
| Group stage | — | — | — |
| Round of 32 | — | — | — |
| Round of 16 | — | — | — |
| Quarter-final | — | — | — |
| Semi-final | — | — | — |
| Third-place play-off | — | — | — |
| Final | — | — | — |
Brier by favourite confidence
| Segment | Matches | Brier | Δ vs overall |
|---|---|---|---|
| P_fav < 40% | — | — | — |
| P_fav 40-60% | — | — | — |
| P_fav 60-80% | — | — | — |
| P_fav >= 80% | — | — | — |
O que os três números significam
Pense em um meteorologista. Qualquer um pode dizer "70% de chance de chuva." Os bons acertam cerca de 70% das vezes quando dizem isso. Esses três números verificam o modelo da mesma forma.
- Brier score — as probabilidades ficaram perto da realidade? Para cada jogo, medimos quão longe a previsão ficou do que realmente aconteceu e tiramos a média. Uma bola de cristal perfeita pontua 0; chutar 1 em 3 toda vez pontua cerca de 0,667. Quanto mais baixo, melhor.
- Log loss — a mesma ideia, mas excesso de confiança é punido com força. Declare algo quase certo e erre, e esse número dispara. É a métrica que mantém o modelo humilde. Chutar cego pontua cerca de 1,099. Quanto mais baixo, melhor.
- ECE — "70%" realmente significa 70%? Reunimos todas as previsões de "cerca de 70%" e verificamos com que frequência essas coisas realmente aconteceram. A diferença média, em todos os níveis de confiança, é o ECE. Poucos pontos percentuais significam que as probabilidades declaradas podem ser lidas pelo valor de face. Quanto mais baixo, melhor.
Os dois primeiros recompensam ser preciso e ousado; o último é a verificação de honestidade. Um modelo pode parecer impressionante e ainda assim exagerar sua confiança — medir os três juntos é o que detecta isso.
Quer a mecânica por trás — os modelos componentes e o teste fora de amostra por trás desses números? Está tudo na página de metodologia.