Çalışmamızı kontrol ediyoruz

Olasılıklar gerçekleşiyor mu?

Model bir takıma %70 olasılık verdiğinde, o takım yaklaşık on seferin yedisinde kazanmalıdır. Bu sayfa bunu kontrol eder. Her 2026 Dünya Kupası maçı oynanır oynanmaz puanlanır ve aşağıdaki rakamlar tek bir soru sorar: olasılıklar sadece kendinden emin değil, dürüst mü? (Bunun teknik adı kalibrasyondur.)

Track record

Proven on past tournaments

The short version: when the model says 70%, it happens about 70%. Across these 987 matches its stated chances landed within ~5.6 points of reality, and on average it rated the actual result about 35% more likely than a blind 1-in-3 guess would.

The 2026 tracker below stays empty until the first match kicks off. So to show the model has been tested, not just described, we ran it against tournaments whose results you already know. For each one, the model was rebuilt exactly as it stood the day before kickoff, then graded on every match — it never sees the result it is being marked on. That is what “graded out of sample” means: no peeking, no hindsight.

Each tournament is scored by the production model reconstructed as it stood the day before the tournament's first match: Dixon-Coles and Hierarchical Poisson refit on matches strictly beforehand, Elo rolled forward to each match, and the tournament-tier calibration layer refit on the 24 months of matches before the cutoff. No data from the tournament, or any later match, touches any layer of the fit.

Graded across 987 matches at 24 major tournaments (2014–2024)

0.572

How close the forecasts landed to reality. Lower is better; blind 1-in-3 guessing scores ≈ 0.667.

1.000

Like Brier, but overconfidence is punished harder. Lower is better; blind guessing scores ≈ 1.099.

5.6pp

Does “70%” really mean 70%? The average gap between the two. Lower is better.

Tournament by tournament

One row per tournament: the model rebuilt as it stood the day before it began, then graded on every match through the final (Brier — lower is better, blind 1-in-3 guessing is 0.667). A few thin, early editions sit above that line; the honest measure is all of them pooled, in the box above.

TournamentHostMatchesBrier
Copa América 2024United States320.522
Euro 2024Germany510.613
AFCON 2024Ivory Coast520.651
Asian Cup 2024Qatar510.515
Gold Cup 2023United States310.566
World Cup 2022Qatar640.611
AFCON 2022Cameroon520.686
Gold Cup 2021United States310.341
Copa América 2021Brazil280.481
Euro 2021England510.554
AFCON 2019Egypt520.546
Gold Cup 2019United States310.405
Copa América 2019Brazil260.542
Asian Cup 2019United Arab Emirates510.496
World Cup 2018Russia640.569
Gold Cup 2017United States250.456
AFCON 2017Gabon320.642
Euro 2016France510.668
Copa América 2016United States320.502
Gold Cup 2015United States260.755
Copa América 2015Chile260.686
AFCON 2015Equatorial Guinea320.795
Asian Cup 2015Australia320.434
World Cup 2014Brazil640.565

Reliability diagram

Read it like this: each dot is a group of similar forecasts — left-to-right is what the model predicted, bottom-to-top is how often it actually happened. When the model says 70% and that happens about 70% of the time, the dot sits on the dashed line: perfect calibration. The closer the dots hug the line, the more honest the probabilities; bigger dots mean more matches in that group.

Reliability diagramReliability diagram: predicted probability on the x-axis, observed frequency on the y-axis, binned in deciles across [0, 1]. Closer to the identity line means better-calibrated.0.000.250.500.751.000.000.250.500.751.00n=327n=478n=911n=329n=269n=217n=184n=134n=73n=39predicted probabilityobserved frequency

Brier by favourite confidence

Matches grouped by how confident the model's favourite was (its biggest of the home / draw / away probabilities) — so you can see whether it is as reliable on toss-ups as on heavy favourites.

Favourite confidenceMatchesBrier
P_fav < 40%810.649
P_fav 40-60%4760.633
P_fav 60-80%3180.512
P_fav >= 80%1120.428
  • Out-of-sample: the calibration layer is refit per tournament on pre-tournament data, so these numbers do not reuse the live shipped calibrator (which has seen these results).
  • The uniform 1/3 forecast scores a Brier of 0.667; lower is better. Major-tournament football is high-variance, so a strong model still sits well above a league-season Brier.
  • Calibrated and uncalibrated metrics are reported on the same fixtures so the calibration layer's effect is visible.

Built 2026-05-30 · model 1.0.0 · calibration layer refit on the 24 months before each tournament.

2026 Dünya Kupası: canlı takipçi

Live grading begins once matches are played. Until then, the scoreboard above already grades the model on past tournaments, and the full method is on the methodology page.

No scored matches yet

First scored match expected 2026-06-11. Once the first match is played, this page grades the model in real time: a running accuracy chart and day-by-day breakdown, updated after every game.

Nothing to show yet: the forecasts already exist on each match page, but no games have been played to test them against. For the record so far, the scoreboard above grades the model on past tournaments.

Reliability diagram

Not enough matches yet to draw this. None graded yet; it appears once at least 50 are in. Until then, the scoreboard above already shows this same chart for past tournaments, and the full backtest is on the methodology page.

Brier by competition

SegmentMatchesBrierΔ vs overall
World Cup 2026

Brier by tournament stage

SegmentMatchesBrierΔ vs overall
Group stage
Round of 32
Round of 16
Quarter-final
Semi-final
Third-place play-off
Final

Brier by favourite confidence

SegmentMatchesBrierΔ vs overall
P_fav < 40%
P_fav 40-60%
P_fav 60-80%
P_fav >= 80%

Üç sayı ne anlama geliyor

Bir hava durumu tahmincisi düşünün. Herkes "%70 yağmur olasılığı" diyebilir. İyi olanlar, bunu söylediklerinde gerçekten yaklaşık %70 oranında haklı çıkar. Bu üç sayı modeli aynı şekilde kontrol eder.

  • Brier score: olasılıklar gerçeğe ne kadar yakındı? Her maç için tahminin gerçekte olandan ne kadar uzak kaldığını ölçer, sonra ortalamasını alırız. Mükemmel bir kristal küre 0 alır; körlemesine her seferinde 3'te 1 tahmin etmek yaklaşık 0,667 alır. Düşük olan daha iyidir.
  • Log loss: aynı fikir, ama aşırı güven sert cezalandırılır. Bir şeyi neredeyse kesin ilan edip sonra yanılırsanız bu sayı fırlar. Modeli alçakgönüllü tutan metriktir. Körlemesine tahmin yaklaşık 1,099 alır. Düşük olan daha iyidir.
  • ECE: "%70" gerçekten %70 mu demek? Tüm "yaklaşık %70" tahminlerini toplar ve o şeylerin gerçekte ne sıklıkla gerçekleştiğini kontrol ederiz. Her güven düzeyindeki ortalama fark ECE'dir. Birkaç yüzde puanı, belirtilen olasılıkların doğrudan alınabileceği anlamına gelir. Düşük olan daha iyidir.

İlk ikisi doğru ve cesur olmayı ödüllendirir; sonuncusu dürüstlük kontrolüdür. Bir model etkileyici görünüp yine de güvenini abartabilir; üçünü birlikte ölçmek bunu yakalar.

Bunların arkasındaki mekanizmayı, bileşen modelleri ve bu sayıların arkasındaki örneklem dışı testi mi istiyorsunuz? Hepsi yöntem sayfasında.

Bu sayfadaki metrikler modelin kendi kalibrasyonudur; gözlemlenen maç sonuçlarına karşı puanlanmıştır. Tahmini olasılıklar ile sahada gerçekte olanın karşılaştırması.

Turnuva içi kalibrasyon · onthepitch · onthepitch