A probability publication is a credibility game. Anyone can publish numbers; the question is whether those numbers track outcomes once the matches finish. This page collects the discipline, the architecture, and the limits that determine whether the published probabilities are worth taking seriously.
The short version: the gates on every shipped change are public and reproducible, the failed ablations are published alongside the winners, the tournament-tier performance is reported separately rather than buried in a pooled number, and the parts of the model where confidence is genuinely lower are called out by name.
The discipline
Pre-registered gates. Every model change must clear an 8×90-day walk-forward conjunction gate before shipping: median Brier strictly lower than the baseline and median ECE within +0.2pp of baseline. The conjunction matters — a change that buys sharpness at the cost of calibration is not a win. The gate is implemented in scripts/backtest_models.py --folds 8 --window-days 90 and reproduces deterministically. Three things this discipline prevents:
- p-hacking through favourable folds (8 walks across 2 years; all 8 are reported, not just the best)
- shipping a "small lift" that survives one fold but not the median
- swapping calibration for sharpness without anyone noticing
Published negative results. The variants that failed the gate are published at /research/negative-results/ in full, with the gate verdict and the reason for the verdict. The current corpus includes a rest-day delta ablation, a style-matchup pair-effects model, a state-space time-varying Dixon-Coles, and per-tier Platt + shrinkage-isotonic calibrators — each tested against the same gate, each shown to fail it. Earlier ablations not yet written up as research notes (Bayesian stacking weights, a gradient-boosted meta-learner) are documented as parked artefacts in data/wc2026/; the convention is that any future no-ship arrives with its writeup. If only the winners were published, the shipping ensemble would look more inevitable than it is. The no-ships are evidence of what the corpus and the gate can and cannot distinguish.
Tier-honest reporting. The calibration page reports ECE broken out by friendly, qualifier, and tournament tier — not just pooled. The tournament-tier ECE is the worst of the three (the friendly + qualifier tiers calibrate cleanly under the per-class isotonic fit; the tournament-tier slice is small and over-fits identity, the reason multiple stabilisation variants have been tested and rejected). Pooling masks where the model is weakest; the tier breakdown does not.
Tournament-only audit. The matches that count for WC 2026 are tournament matches. The all-matches Brier (~0.510) is materially lower than the tournament-only Brier (~0.545). Both numbers are in the methodology corpus; the tournament-only audit explicitly notes that the published headline Brier "understates the difficulty of the matches the project actually has to predict in June 2026."
The architecture
Ensemble of three. Production probabilities come from a uniform-weighted average of Elo, Dixon-Coles, and hierarchical Poisson, with per-class isotonic calibration. The three components disagree on which teams they trust most: Dixon-Coles fits attack and defence parameters from goal counts, hierarchical Poisson partial-pools sparse teams toward confederation means, Elo carries the published-rating shape. The average is more robust than any single component, and the gate measures the ensemble's output, not a component in isolation.
Posterior uncertainty. The hierarchical Poisson ships a full PyMC NUTS posterior alongside the point estimate. For each fixture, the 5th / 50th / 95th percentile P(home win) is computable from the posterior draws. This gives readers a credible interval around the point estimate — not just "the model says 64%" but "64% with the surrounding band the parameter uncertainty implies." The point estimate's reliability is itself measurable, not asserted.
Methodology in the open. The model architecture, training procedure, feature lists, and backtest design are all on the methodology page. Anything the project ships, a reader can reproduce. The fit scripts in scripts/ and the backtest harness in scripts/backtest_models.py are the same code that produced the headline numbers; nothing is held back in a private repository.
Reproducible from public sources. No proprietary data feeds. Every input — international results, FIFA Elo, FBref-derived xG, ClubElo league strength, transfermarkt valuations — is publicly available. The README's scripts/ directory exposes the pull scripts for each ingestion path.
The limits
Calibration is not certainty. A 70%-rated outcome happens roughly 70% of the time across many similar predictions. The next individual 70% prediction can still go either way. Individual matches are not where calibration lives — the aggregate is. A reader who treats one prediction as an oracle will be surprised regardless of how good the model is.
Tournament matches are harder than friendlies. The tournament-tier Brier is ~7% higher than the pooled Brier. Knockout games in particular have higher variance than friendlies — formation changes, manager rotation, single-elimination pressure, fresh players who haven't played together. Two evidence points support this: the tournament-only audit linked above, and the consistent finding across calibrator variants that the tournament slice is the hardest tier to fit stably.
The corpus has a ceiling. Capacity-heavy variants — Bayesian stacking weights, a gradient-boosted meta-learner, a state-space time-varying Dixon-Coles, per-tier Platt and shrinkage-isotonic calibrators — have all failed the gate on the current corpus. The simple priors (composite-α offset, GK rating offset, confederation-pooled hierarchical priors) keep clearing it. The most honest reading: the model is near its capacity ceiling on the corpus available today. Further sophistication is likely to keep failing the gate until the corpus grows materially. WC 2026 will roughly double n on the tournament tier and the ablations will be rerun against the larger corpus.
Some data is missing. ~4 of 48 qualified teams are missing international xG; set-piece takers are incomplete for non-Big-5 leagues; 8 teams are missing from the team-style vectors corpus (StatsBomb coverage limit). Each gap is documented on the methodology page and on the data-sources matrix. The fits compensate via partial pooling, but the compensation is acknowledged in the relevant fit script's docstring rather than papered over.
How to read these numbers
If you treat the published probabilities as calibrated probability estimates with their stated uncertainty — useful for understanding match dynamics, comparing teams, framing storylines, or calibrating your own forecasts against an external reference — you're using the model the way it was designed to be used.
If you treat them as oracle predictions of individual outcomes, you'll be repeatedly surprised. That is not a model failure; it is the nature of probabilistic prediction. When the model says a team has a 20% chance of winning the tournament, that does not mean the team will or will not win. It means a reasonable distribution over win-worlds and not-win-worlds, weighted by the evidence the model has seen, and recoverable from a process that itself has been measured.
The confidence is not in any individual probability. It is in the discipline that produced the probabilities, the public evidence that the production ensemble clears the gate, the credible intervals that surround the point estimate, and the published record of variants that did not clear the gate. The published number is the visible part; the discipline behind it is the part worth trusting.