Verificando nosso trabalho

As probabilidades se confirmam?

Quando o modelo diz que uma seleção tem 70% de chance, essa seleção deveria vencer cerca de sete em cada dez vezes. Esta página verifica se isso acontece. Cada jogo da Copa do Mundo 2026 é avaliado no momento em que é disputado, e os números abaixo fazem uma pergunta: as probabilidades são honestas, não apenas confiantes? (O nome técnico para isso é calibração.)

Track record

Proven on past tournaments

The short version: when the model says 70%, it happens about 70%. Across these 987 matches its stated chances landed within ~5.6 points of reality, and on average it rated the actual result about 35% more likely than a blind 1-in-3 guess would.

The 2026 tracker below stays empty until the first match kicks off. So to show the model has been tested, not just described, we ran it against tournaments whose results you already know. For each one, the model was rebuilt exactly as it stood the day before kickoff, then graded on every match — it never sees the result it is being marked on. That is what “graded out of sample” means: no peeking, no hindsight.

Each tournament is scored by the production model reconstructed as it stood the day before the tournament's first match: Dixon-Coles and Hierarchical Poisson refit on matches strictly beforehand, Elo rolled forward to each match, and the tournament-tier calibration layer refit on the 24 months of matches before the cutoff. No data from the tournament, or any later match, touches any layer of the fit.

Graded across 987 matches at 24 major tournaments (2014–2024)

: 0.572
: 1.000
: 5.6pp

Tournament by tournament

One row per tournament: the model rebuilt as it stood the day before it began, then graded on every match through the final (Brier — lower is better, blind 1-in-3 guessing is 0.667). A few thin, early editions sit above that line; the honest measure is all of them pooled, in the box above.

Tournament	Host	Matches	Brier
Copa América 2024	United States	32	0.522
Euro 2024	Germany	51	0.613
AFCON 2024	Ivory Coast	52	0.651
Asian Cup 2024	Qatar	51	0.515
Gold Cup 2023	United States	31	0.566
World Cup 2022	Qatar	64	0.611
AFCON 2022	Cameroon	52	0.686
Gold Cup 2021	United States	31	0.341
Copa América 2021	Brazil	28	0.481
Euro 2021	England	51	0.554
AFCON 2019	Egypt	52	0.546
Gold Cup 2019	United States	31	0.405
Copa América 2019	Brazil	26	0.542
Asian Cup 2019	United Arab Emirates	51	0.496
World Cup 2018	Russia	64	0.569
Gold Cup 2017	United States	25	0.456
AFCON 2017	Gabon	32	0.642
Euro 2016	France	51	0.668
Copa América 2016	United States	32	0.502
Gold Cup 2015	United States	26	0.755
Copa América 2015	Chile	26	0.686
AFCON 2015	Equatorial Guinea	32	0.795
Asian Cup 2015	Australia	32	0.434
World Cup 2014	Brazil	64	0.565

Reliability diagram

Read it like this: each dot is a group of similar forecasts. left-to-right is what the model predicted, bottom-to-top is how often it actually happened. When the model says 70% and that happens about 70% of the time, the dot sits on the dashed line: perfect calibration. The closer the dots hug the line, the more honest the probabilities; bigger dots mean more matches in that group.

Brier by favourite confidence

Matches grouped by how confident the model's favourite was (its biggest of the home / draw / away probabilities) — so you can see whether it is as reliable on toss-ups as on heavy favourites.

Favourite confidence	Matches	Brier
P_fav < 40%	81	0.649
P_fav 40-60%	476	0.633
P_fav 60-80%	318	0.512
P_fav >= 80%	112	0.428

Out-of-sample: the calibration layer is refit per tournament on pre-tournament data, so these numbers do not reuse the live shipped calibrator (which has seen these results).
The uniform 1/3 forecast scores a Brier of 0.667; lower is better. Major-tournament football is high-variance, so a strong model still sits well above a league-season Brier.
Calibrated and uncalibrated metrics are reported on the same fixtures so the calibration layer's effect is visible.

Built 2026-05-30 · model 1.0.0 · calibration layer refit on the 24 months before each tournament.

Copa do Mundo 2026 — acompanhamento ao vivo

Across 96 graded forecasts so far, the model's overall Brier score is 0.497 (lower is better; blind 1-in-3 guessing scores ≈ 0.667). The reliability diagram below plots what the model predicted against how often it actually came true. The closer to the diagonal, the more honest the probabilities.

Overall, across 96 scored matches

: 0.497
: 0.837
: 0.167

Rolling Brier, 10-match window

The model's accuracy over its most recent 10 matches, re-figured after each game. Lower is better; the dashed line is what blind 1-in-3 guessing would score, so anything below it is real skill.

Per-matchday breakdown

Date	Matches	Brier	Log loss
2026-06-11	2	0.404	0.723
2026-06-12	2	0.817	1.280
2026-06-13	4	0.771	1.178
2026-06-14	4	0.595	1.008
2026-06-15	4	1.148	1.632
2026-06-16	4	0.197	0.438
2026-06-17	4	0.640	1.005
2026-06-18	4	0.426	0.739
2026-06-19	4	0.463	0.787
2026-06-20	4	0.531	0.854
2026-06-21	4	0.603	0.928
2026-06-22	4	0.266	0.529
2026-06-23	4	0.433	0.706
2026-06-24	6	0.410	0.750
2026-06-25	6	0.506	0.868
2026-06-26	6	0.388	0.704
2026-06-27	6	0.434	0.763
2026-06-29	2	0.612	0.996
2026-06-30	3	0.500	0.866
2026-07-01	3	0.478	0.830
2026-07-02	3	0.327	0.622
2026-07-03	3	0.460	0.796
2026-07-04	1	0.225	0.470
2026-07-06	2	0.443	0.784
2026-07-07	2	0.541	0.909
2026-07-09	1	0.573	0.960
2026-07-10	1	0.280	0.564
2026-07-11	2	0.250	0.517
2026-07-15	1	0.537	0.911

Updated 2026-07-16.

Reliability diagram

Brier by competition

Segment	Matches	Brier	Δ vs overall
World Cup 2026	96	0.497	+0.000

Brier by tournament stage

Segment	Matches	Brier	Δ vs overall
Group stage	72	0.516	+0.019
Round of 32	15	0.449	-0.048
Round of 16	4	0.492	-0.005
Quarter-final	4	0.338	-0.159
Semi-final	1	0.537	+0.039
Third-place play-off	—	—	—
Final	—	—	—

Brier by favourite confidence

Segment	Matches	Brier	Δ vs overall
P_fav < 40%	11	0.669	+0.172
P_fav 40-60%	47	0.540	+0.043
P_fav 60-80%	32	0.373	-0.125
P_fav >= 80%	6	0.513	+0.016

O que os três números significam

Pense em um meteorologista. Qualquer um pode dizer "70% de chance de chuva." Os bons acertam cerca de 70% das vezes quando dizem isso. Esses três números verificam o modelo da mesma forma.

Brier score — as probabilidades ficaram perto da realidade? Para cada jogo, medimos quão longe a previsão ficou do que realmente aconteceu e tiramos a média. Uma bola de cristal perfeita pontua 0; chutar 1 em 3 toda vez pontua cerca de 0,667. Quanto mais baixo, melhor.
Log loss — a mesma ideia, mas excesso de confiança é punido com força. Declare algo quase certo e erre, e esse número dispara. É a métrica que mantém o modelo humilde. Chutar cego pontua cerca de 1,099. Quanto mais baixo, melhor.
ECE — "70%" realmente significa 70%? Reunimos todas as previsões de "cerca de 70%" e verificamos com que frequência essas coisas realmente aconteceram. A diferença média, em todos os níveis de confiança, é o ECE. Poucos pontos percentuais significam que as probabilidades declaradas podem ser lidas pelo valor de face. Quanto mais baixo, melhor.

Os dois primeiros recompensam ser preciso e ousado; o último é a verificação de honestidade. Um modelo pode parecer impressionante e ainda assim exagerar sua confiança — medir os três juntos é o que detecta isso.

Quer a mecânica por trás — os modelos componentes e o teste fora de amostra por trás desses números? Está tudo na página de metodologia.