Do teams try harder in must-win games? (No, actually)

Status: Not shipped. Gate fails on every cell; matchday-controlled analysis shows zero residual motivation effect. Backtest date: 2026-05-27 Reproducer: scripts/backtest_motivation_residuals.py Persisted output: data/wc2026/motivation_residuals.json Charts: web/public/research/img/motivation-residuals-{brier,outcomes,pools,matchday}.png

Hypothesis

Football economics literature (Brams & Ismail 2018; Apesteguia & Palacios-Huerta 2010 on tournament-incentive distortions) reports that match outcomes in the final round of group-stage tournaments deviate from baseline expectations when the two sides have asymmetric or absent qualification incentives. The textbook cases:

Dead rubbers (both teams already through, or both already eliminated) — coaches rotate, intensity drops, outcomes get noisier than the model's calibrated predictions.
Asymmetric stakes (e.g. one side needs a win, the other is content with a draw) — the defending side parks the bus, the attacking side rushes; draws become more common than the prior baseline implies. The 1982 "Disgrace of Gijón" (Germany 1-0 Austria, both knew that scoreline progressed both at Algeria's expense) is the canonical case.
Position-only stakes (qualified but the result determines which round-of-16 opponent) — opaque incentives that may produce surprising results.

The model in production today (Elo + DC + HP ensemble) is fit on the full population of international matches — friendlies, qualifiers, group stages, knockouts. It has no representation of in-tournament context. If "hidden motivations" materially distort outcomes, we should see systematically worse Brier on identifiable subgroups of group-stage matches: dead rubbers, asymmetric-stakes matches, etc.

This note answers: do match outcomes deviate from the baseline model in identifiable incentive-state subgroups of historical major-tournament group-stage matches?

Setup

Field	Value
Today	2026-05-27
Tournament corpus	23 editions: WC 1990/1994/1998/2002/2006/2010/2014/2018/2022, Euro 2004/2008/2012/2016/2020/2024, Copa 2019/2021/2024, AFCON 2019/2021/2023, Asian Cup 2019/2023
Group-stage matches	830
Baseline	Elo-only `bracket_mc.match_probs` (the production single-model fallback) — Elo gap via logistic, 22% draw prob constant
State engine	`scripts/_motivation_state.py` — pre-kickoff group-standings reconstruction + 8-scoreline enumeration of the team's match plus all other pending in-group matches
Incentive states	`ALREADY_THROUGH`, `ELIMINATED`, `MUST_WIN`, `DRAW_ENOUGH`, `BUBBLE`, `OPEN`
Reference cell	`OPEN × OPEN` (n=238) — matchday 1 with no prior info to differentiate
Bootstrap	2000 resamples, seed 1729, for CI on per-cell Brier delta vs reference

Each side's incentive state is computed independently from group standings at kickoff, with the FIFA simultaneous-kickoff convention enforced (matches on the same calendar date are treated as parallel kickoffs — neither in played nor in this match's antecedents). Tiebreakers: points → goal difference → goals scored (head-to-head not modelled; rare as a decisive tiebreaker in the corpus). The classifier is unit-tested against synthetic group fixtures plus the canonical Gijón 1982 case (tests/test_motivation_state.py).

Result — per-cell analysis (original approach)

State pair	n	Brier	Δ vs OPEN	95% CI	obs H/D/A	pred H/D/A	gate
open × open	238	0.5540	—	—	0.51/0.25/0.24	0.53/0.22/0.25	reference
bubble × bubble	365	0.5979	+0.044	[−0.027, +0.107]	0.40/0.28/0.32	0.42/0.22/0.36	—
bubble × draw_enough	59	0.6397	+0.086	[−0.034, +0.203]	0.29/0.27/0.44	0.33/0.22/0.45	—
already_through × bubble	29	0.6313	+0.077	[−0.116, +0.294]	0.28/0.17/0.55	0.29/0.21/0.50	—
bubble × eliminated	29	0.4746	−0.079	[−0.209, +0.071]	0.31/0.14/0.55	0.37/0.22/0.41	—
draw_enough × must_win	23	0.5818	+0.028	[−0.122, +0.176]	0.26/0.30/0.43	0.34/0.22/0.45	—
already_through × draw_enough	21	0.6517	+0.098	[−0.095, +0.309]	0.38/0.38/0.24	0.38/0.22/0.40	—
draw_enough × eliminated	17	0.6925	+0.139	[−0.079, +0.359]	0.35/0.29/0.35	0.40/0.21/0.38	—
already_through × already_through	15	0.5583	+0.004	[−0.150, +0.164]	0.47/0.27/0.27	0.44/0.22/0.34	—
eliminated × eliminated	14	0.4573	−0.097	[−0.231, +0.045]	0.57/0.07/0.36	0.40/0.22/0.38	—
already_through × eliminated	12	0.4581	−0.096	[−0.326, +0.148]	0.33/0.17/0.50	0.40/0.22/0.39	—

Per-cell mean Brier vs OPEN×OPEN reference, with bar colour encoding CI-on-Δ status.

Predicted vs observed H/D/A for OPEN reference and the draw_enough × must_win focus cell.

No cell with n ≥ 30 has a 95% CI that excludes zero. The per-cell approach fragments 830 matches into 14 cells, leaving most below the n ≥ 30 gate. The original 14-edition analysis (494 matches) produced the same conclusion; doubling the corpus did not change any cell's verdict.

Result — pooled mechanistic analysis (new)

To address the cell-fragmentation problem, we pooled state pairs into five mechanistic buckets:

Pool	Definition	n	Brier	Δ vs ref	95% CI	obs draw%	pred draw%
Reference	OPEN × OPEN (matchday 1)	238	0.554	—	—	24.8%	21.8%
Standard competition	BUBBLE × BUBBLE	365	0.598	+0.044	[−0.027, +0.107]	28.2%	21.9%
Draw defending	Any cell with DRAW_ENOUGH	123	0.633	+0.079	[−0.009, +0.170]	29.3%	21.8%
Dead rubber	Both sides in {AT, ELIM}	41	0.495	−0.060	[−0.170, +0.055]	17.1%	21.9%
Asymmetric other	One settled + one fighting (no DE)	63	0.572	+0.018	[−0.110, +0.153]	15.9%	21.7%

Pooled motivation analysis — mechanistic buckets.

The draw_defending pool (n=123) is the strongest candidate: Brier +0.079 worse than reference, CI [−0.009, +0.170] — nearly excludes zero. The draw rate is elevated (29.3% observed vs 21.8% predicted), consistent with the "park the bus" hypothesis. But the standard_competition pool (BUBBLE×BUBBLE, n=365) also shows elevated draw rates (28.2% vs 21.9%) and worse Brier (+0.044), raising the question: is the draw_defending signal really about motivation, or is it the general matchday effect?

Result — matchday ablation

To disentangle motivation from the matchday confounder, we compared Brier by matchday, ignoring motivation states entirely:

Matchday	n	Brier	Δ vs MD1	95% CI
1	274	0.549	—	—
2	274	0.596	+0.046	[−0.025, +0.112]
3	276	0.607	+0.058	[−0.013, +0.126]

Matchday ablation — Brier by matchday.

The matchday effect is monotonic: matchday 3 is +0.058 Brier worse than matchday 1 across all matches, CI [−0.013, +0.126]. This accounts for most of the raw draw_defending signal (+0.079), which pools matchday-2 and matchday-3 matches that are inherently harder to predict regardless of motivation.

Result — matchday-controlled test (the key finding)

The definitive test: compare draw_defending vs standard_competition on matchday 3 only, holding the matchday confounder constant:

Pool (MD3 only)	n	Brier	obs draw%	pred draw%
Standard competition	52	0.645	28.8%	22.0%
Draw defending	120	0.642	30.0%	21.8%
Dead rubber	39	0.509	17.9%	22.0%
Asymmetric other	61	0.558	14.8%	21.7%

Δ Brier (draw_defending − standard) = −0.003, 95% CI [−0.138, +0.130]. Once matchday is controlled, the draw_defending pool has no additional Brier worsening beyond the standard competition pool.

Δ Brier (dead_rubber − standard) = −0.136, 95% CI [−0.285, +0.024]. Dead rubbers are notably better calibrated than standard competition at MD3, though the CI just touches zero.

Result — draw-rate excess test

Direct test of whether draw_defending matches show more draws than expected (excess = observed − predicted draw rate), compared to the reference pool:

Pool	n	Observed draw%	Predicted draw%	Draw excess
Draw defending	123	29.3%	21.8%	+0.074
Reference	238	24.8%	21.8%	+0.030

Δ excess = +0.044, 95% CI [−0.057, +0.144]. The draw rate is slightly higher in draw_defending, but the difference from the reference pool does not exclude zero. Both pools show excess draws relative to the model's fixed 22% draw prior.

What the results say

No motivation effect survives matchday control. The entire apparent draw_defending signal (+0.079 Brier vs reference) is explained by the matchday confounder (+0.058 for MD3 vs MD1). At matchday 3, draw_defending and standard_competition have essentially identical Brier (0.642 vs 0.645). The motivation classification adds zero predictive information beyond "this is a later matchday."
The matchday effect is real but not about motivation. Later group-stage matches are harder to predict — Brier worsens monotonically from matchday 1 (0.549) to matchday 3 (0.607). Plausible mechanisms: tactical adjustments, form updates, and standings interactions (not modelled by the pre-tournament Elo snapshot) create irreducible uncertainty that the matchday-1 baseline doesn't face.
Dead rubbers are easier to predict, not harder. At matchday 3, dead_rubber Brier (0.509) is 0.136 better than standard_competition (0.645), CI [−0.285, +0.024]. When both teams have nothing to play for, variance drops and the Elo prediction — which tends toward the prior — becomes better calibrated. This is the opposite of the "dead rubbers are noisy" hypothesis from the literature.
The Elo baseline's fixed 22% draw prior is miscalibrated for tournament group stages. All pools show observed draw rates above 22%: reference at 24.8%, standard at 28.2%, draw_defending at 29.3%. The model systematically underpredicts draws in tournament settings. Correcting this prior would reduce Brier across all pools equally — it's not a motivation-specific issue.
The earlier 14-edition finding held up. Doubling the corpus from 494 to 830 matches (23 editions, 1990–2024) reinforced the original negative result on the per-cell gate. The draw_enough × must_win cell's apparent 46% draw rate with n=13 (the most suggestive finding in the original analysis) regressed to 30% with n=23 — classic small-sample reversion to the mean.

Decision

Gate (defined in advance):

Criterion	Required	Observed (any cell/pool)	Pass?
Some non-OPEN cell/pool with n ≥ 30	yes	draw_defending (123), standard_competition (365)	yes
Δ Brier on that cell ≥ 0.02 in absolute value	yes	draw_defending Δ=+0.079	yes
95% bootstrap CI on the Δ excludes zero	yes	draw_defending CI [−0.009, +0.170]	no
Matchday-controlled Δ ≥ 0.02	yes (new)	draw_defending vs standard at MD3: Δ=−0.003	no

Verdict: do not ship a predict-time motivation adjustment. The per-cell analysis fails the CI gate. The pooled draw_defending pool nearly passes but falls short. And the matchday-controlled test — the strongest available test — shows zero residual motivation effect after controlling for the matchday confounder.

What this backtest can't tell you

The DC + HP + Elo ensemble's residuals on these same cells. This note uses the Elo-only baseline because a per-edition DC refit takes ~40 min. The ensemble is the production model; its residuals could differ.
Whether a matchday-aware baseline changes the picture. The Elo baseline treats all matches as identical. A model that knows "this is matchday 3 of a group stage" might already absorb the matchday effect, making any residual motivation signal more visible — or confirming that nothing remains.
Asymmetry direction. State pairs are sorted alphabetically; a directional analysis (does the must_win team specifically underperform when away?) needs another categorical split and even smaller sample sizes.
Best-third qualification finesse. For 24-team formats, the classifier uses within-group standings only and doesn't simulate other groups' "best 3rd" qualification routes.

Shipped fix — group-stage draw calibration

Update — re-swept after extremization. The factor below (1.30) is the pre-extremization sweep. After the d = 1.15 ensemble extremization landed — which sharpens the blend and lifts raw group-stage P(draw) to ~24.7% — the factor was re-swept and lowered to 1.05, the value now in bracket_mc.py and methodology.md. The 1.30 sweep and the backtest table below remain the record for the pre-extremization corpus; because the Brier response is a flat plateau either way, the qualitative conclusions (a small multiplicative draw nudge helps; the per-matchday refinement does not) are unchanged.

Two changes shipped to bracket_mc.py based on this research:

1. Calibrator bypass for group stage. The tournament-tier calibrator's D isotonic curve pools group-stage matches (~26% draws) with knockout matches (~0% recorded draws, post-ET/penalty results). This suppresses group-stage draw predictions: raw D=0.22 maps to calibrated D=0.16. The fix: group-stage matches use raw (uncalibrated) ensemble probs while knockout matches keep the fully calibrated probs.

2. Draw scaling factor. The raw ensemble still underpredicts group-stage draws (mean 22.2% vs empirical 26.8%). A Brier-minimising sweep on the 830-match historical corpus found the optimal multiplicative factor is 1.30 (flat Brier plateau from 1.20–1.40, so the choice is robust). Applied to raw D in the group-stage table, renormalized.

Backtest results (830 group-stage matches, 1990–2024):

Model	Mean D%	Brier	Δ vs old
Old calibrated (shipped pre-fix)	16.3%	0.5954	—
Raw ensemble (bypass only)	22.2%	0.5834	−0.012 (−2.0%)
Raw × 1.30 (shipped fix)	26.6%	0.5795	−0.016 (−2.7%)
Theoretical optimal (sweep)	26.8%	0.5804	−0.015 (−2.5%)

The draw scaling at 1.30 slightly outperforms the sweep-optimal because the multiplicative scaling concentrates its adjustment on matches where the raw D is furthest from the empirical rate (lopsided matches), whereas the sweep used a flat DRAW_PROB constant.

Impact on WC 2026 probabilities (20k sims): strong favorites' group-win rates decrease (Argentina 87→77%, France 60→62%), underdogs gain substantially (USA advance 53→73%, South Africa 2→26%, DR Congo 7→31%). Overall advancement rates decrease slightly because shared points make advancement harder. Tournament winner probabilities shift modestly (Spain −5pp, Brazil +2pp, France +2pp, Germany +2pp).

Follow-up: per-matchday + dead-rubber draw factors

Status: Not shipped. Improvement does not exclude zero; complexity not justified.

Extended the flat GROUP_STAGE_DRAW_FACTOR = 1.30 to per-matchday factors with a dead-rubber carve-out. Sweep on the same 830-match corpus:

Strategy	Draw factors	Brier	Δ vs flat 1.30
Flat (shipped)	all=1.30	0.5804	—
Per-matchday	MD1=1.13, MD2=1.33, MD3=1.24	0.5800	−0.0004
Per-MD + dead rubber	MD1=1.13, MD2=1.33, MD3=1.24, DR=0.81	0.5794	−0.0010

Dead-rubber characteristics (n=39): draw rate 17.9% (vs 26.8% overall), favourites win 64% (vs predicted 56%), underdogs win 18%.

Bootstrap CI (2000 resamples, seed 1729) for per-MD+DR vs flat: [−0.0032, +0.0012] — does not exclude zero. The improvement is 0.17% of base Brier, well within noise.

Per-matchday draw rates: MD1=24.1%, MD2=28.8%, MD3=25.0%. MD2 having the highest draw rate (not MD3) is mildly surprising — possibly reflects both teams being cautious after establishing positions on MD1.

Decision: not shipped. The flat factor=1.30 captures the bulk of the draw miscalibration fix (−2.7% Brier). Per-matchday refinements add simulation complexity (matchday tracking, dead-rubber detection mid-sim) for a signal indistinguishable from zero at n=830. Revisit if corpus grows past ~2000 group-stage matches.

Follow-ups worth considering

~~Fix the draw prior.~~ Done — two-stage fix: calibrator bypass + multiplicative scaling (GROUP_STAGE_DRAW_FACTOR = 1.30). Captures 100%+ of the available Brier improvement vs the sweep-optimal constant.
~~Matchday-aware baseline.~~ Investigated, not shipped — see above. Per-matchday draw factors improve Brier by only 0.0004; adding dead-rubber detection reaches 0.0010 but CI includes zero.
~~Dead-rubber calibrator.~~ Investigated, not shipped — see above. n=39 dead rubbers; directional signal (low draws, favourite overperformance) but too small to act on.

Files touched

scripts/_motivation_state.py — incentive-state classification engine.
scripts/build_motivation_states.py — joiner: YAML + intl results → states CSV.
scripts/build_motivation_baselines.py — Elo baseline per match.
scripts/backtest_motivation_residuals.py — residual analysis + gate + pooled/draw/matchday analyses.
scripts/plot_motivation_residuals.py — four charts.
tests/test_motivation_state.py — 13 tests covering arithmetic, the five classification states, OPEN shortcut, Gijón 1982.
data/wc2026/tournament_groups.yml — 23 editions (1990–2024; expanded from original 14).
scripts/validate_tournament_groups.py — sanity-checks the YAML against the CSV.
data/wc2026/motivation_states.csv — persisted output (gitignored).
data/wc2026/motivation_baselines.csv — persisted output (gitignored).
data/wc2026/motivation_residuals.json — persisted output (gitignored).
web/public/research/img/motivation-residuals-{brier,outcomes,pools,matchday}.png — published charts.
documentation/research-notes/hidden-motivations.md — this file.
web/public/research/notes/hidden-motivations.md — mirror.
scripts/bracket_mc.py — group-stage draw calibration bypass + scaling factor.
tests/test_bracket_mc.py — test for group-stage raw draw probs.

Reproducing

.venv/bin/python scripts/validate_tournament_groups.py
.venv/bin/python scripts/build_motivation_states.py
.venv/bin/python scripts/build_motivation_baselines.py
.venv/bin/python scripts/backtest_motivation_residuals.py
.venv/bin/python scripts/plot_motivation_residuals.py

Requires data/raw/intl/results.csv (from python scripts/pull_intl_results.py) and data/wc2026/intl_elo_history.csv (from python scripts/build_intl_elo_history.py).

全文 · 免费