リサーチノート

Is composite *coverage* the lever for the player-strength offset? (No)

ステータス: Not shipped. Negative result (2026-05-29). Broadening theバックテスト日: 29 May 2026全文無料 · 1,746 語

全文 · 無料

Status: Not shipped. Negative result (2026-05-29). Broadening the player-composite's match coverage — whether honestly (point-in-time WC composite) or optimistically (a contaminated full-coverage proxy) — does not beat the shipped hand-set α = 0.05 current-composite offset (Model 16). No production artefact was changed.

Backtest date: 2026-05-29 Reproducer: scripts/backtest_pit_composite_offset.py --today 2026-05-29 Build: scripts/build_pit_team_composite.py Tests: tests/test_build_pit_team_composite.py Sidecars (gitignored): data/wc2026/pit_team_composite.csv, data/wc2026/pit_composite_offset_backtest.json Follows: documentation/research-notes/player-strength-fitted-alpha.md

Hypothesis under test

The fitted-α note (2026-05-29) closed with: "the real lever is coverage, not coefficient form: the offset will only ever touch WC-vs-WC fixtures until the player composite is extended beyond the 48 qualifiers." The player composite (data/wc2026/team_composite_sum.csv) covers only the 48 WC2026 qualifiers, so the Model-16 offset Δ = α·(comp_home − comp_away) fires on just 228 of 1 926 holdout matches (11.8%), one walk with n=2. The claim: more coverage would let the offset (and/or a fitted α) actually improve Brier.

This note tests that claim. Verdict: coverage is not the lever. Even at 100% coverage (with an optimistic, contaminated proxy) the offset does not beat the incumbent, and a real point-in-time composite at the same ~11% coverage is worse than the shipped fixed-roster composite.

The hard correctness constraint: point-in-time

An honest backtest must use squad strength as of each match date. A 2024 match must use ~2024 ratings, not 2026. FBref player season tables are already historical, so per-season player ratings exist; the hard part is knowing which players were in each nation's squad at each historical date.

Feasibility finding 1 — broadening to non-WC nations is not tractable on-disk

The entire player infrastructure is scoped to the 48 WC2026 qualifiers:

AssetScope
players.csv (2 627 rows)48 WC countries only
ratings_player.csv (873 players, seasons 2010–2026)only players who appear in a WC squad
intl_career_timeline.csv (1 563 players, 48 teams)career histories of current WC-squad players only

The holdout has 231 distinct teams; 183 of them are non-WC (Jamaica, Costa Rica, Nigeria, Italy, …). To give those teams a composite we'd need (a) their historical rosters per date and (b) per-season ratings for those players. No on-disk source carries either. FBref via soccerdata exposes INT-World Cup / INT-European Championship tournament rosters but not the full qualifier/friendly roster history for ~100 nations. Reconstructing point-in-time squads for a broadened nation set is a substantial new data-acquisition effort, out of scope for this experiment. Per the task's Step 3 we did not fake it; instead we ran the two arms below.

Feasibility finding 2 — survivorship bias in the only historical-roster source

intl_career_timeline.csv is built from the career histories of the current 48 squads, so for any past year it captures only those current-squad players who were already active then. Mean reconstructed squad size (a real squad is ~23–26):

Year2014201620182020202220242025
reconstructed size2.85.18.910.417.524.026.7

This is pure survivorship: a PIT composite from this source is honest only for ~2023 onward (the 2026 roster ≈ the 2024–2025 active roster) and increasingly contaminated before that. The current backtest holdout is entirely 2024-06 → 2026-05, so within that window the PIT composite is a defensible "strength as of match year" — which is exactly why Arm A below is labelled REAL rather than contaminated. (Restricted to rated players, the active pool is thinner still — ~13–14 in 2024–2025 — so the top-11 selection often normalises up from a sub-11 pool. Honest, but noisier than the 2026 projected XI.)

Coverage-ceiling table (analytic, no contamination)

How the both-teams-covered fixture fraction grows as the covered-nation set expands, holding the 8-walk holdout fixed (top-N teams by holdout appearances):

covered nationsboth-covered fixtures% of holdout
48 (current WC set)22811.6%
8060430.7%
10078940.1%
1501 38270.3%
231 (all)1 966100.0%

So coverage could rise from 11.6% to 40–70% with ~100–150 nations — if a composite of comparable quality existed for them. The two arms below test whether that would actually help.

Method

Same 8-walk × 90-day walk-forward harness, same per-walk DC + HP refit, same DC+HP uniform-mean target and conjunction gate as the incumbent scripts/backtest_composite_offset.py. Three offset policies, evaluated on identical fits + eval rows:

  • CURRENT (incumbent surface) — the shipped fixed-2026-roster composite, keyed by team name. WC-vs-WC only.
  • ARM A — PIT (REAL, honest in-window) — a point-in-time composite keyed by (team, match-year) from build_pit_team_composite.py (active-in-year roster × per-season player rating). Same ~11% WC-vs-WC coverage; tests whether temporal accuracy of the composite helps.
  • ARM B — SYNTHETIC (SENSITIVITY, CONTAMINATED — not ship-eligible) — a proxy composite for every team derived from the pre-cutoff DC attack−defence strength, rescaled to the WC-composite distribution, so the offset fires on 100% of holdout matches. This is contaminated by construction (the proxy is read off the same fit it then helps predict). It exists only to answer "does the offset mechanism scale with coverage at all?".

Harness fidelity check

CURRENT reproduces the incumbent baselines bit-for-bit: α=0 → 0.514023 and α=0.05 → 0.511581, identical to player-strength-fitted-alpha.md. No methodology drift; the three-way comparison is apples-to-apples.

Results — median Brier (DC+HP uniform mean, 8 walks)

The ship gate for this experiment: median Brier strictly below the incumbent CURRENT α=0.05 (0.511581) AND ECE within +0.2pp of it (≤ 8.01pp).

Policycoveragebest αmedian Briermedian ECEbeats incumbent?
CURRENT α=0.05 (incumbent)11.8%0.050.5115817.81pp— (reference)
CURRENT α=0.10 (off-grid)11.8%0.100.5108708.72ppBrier yes, ECE fails (+0.91pp)
ARM A — PIT (real)10.8%0.050.5128088.24ppno (Brier higher)
ARM B — synthetic (contaminated)100.0%0.020.5126306.65ppno (Brier higher)

Every α in both arms has median Brier ≥ the incumbent's 0.511581. Neither arm clears the ship gate.

Per-α detail:

αCURRENT BrierPIT Brier (real)SYNTH Brier (contam.)
0.0000.5140230.5140230.514023
0.0050.5136990.5138250.513556
0.0100.5133920.5136450.513167
0.0200.5128320.5133380.512630
0.0500.5115810.5128080.512968
0.1000.5108700.5129320.520006

Improvement of the best α over each policy's own α=0 baseline:

Policycoveragebest ΔBrier vs own α=0
CURRENT11.8%+31.5 bp (α=0.10)
ARM A — PIT10.8%+12.2 bp (α=0.05)
ARM B — synthetic100.0%+13.9 bp (α=0.02)

Why coverage is not the lever

The decisive contrast is the last table: the 11.8%-coverage CURRENT composite produces the largest per-baseline improvement (+31.5 bp), while the 100%-coverage synthetic produces only +13.9 bp. More coverage with a weaker composite does not beat narrower coverage with a strong one. Two mechanisms:

  1. The synthetic proxy is redundant with DC. It's built from the DC attack−defence strength, which the model already encodes. Adding a re-scaled copy of the fit back as a log-rate offset moves predictions but carries almost no independent signal — so even firing on 100% of matches it barely improves Brier, and at α=0.10 it overshoots into a −60 bp regression. A real broadened composite (TM value + per-90 stats for non-WC players) would carry more independent signal — but we have no such data, which is finding 1. The contaminated proxy is the optimistic ceiling for "coverage alone", and it loses.

  2. The PIT composite is honest but noisier than the fixed roster. At the same ~11% coverage, PIT is worse than CURRENT (0.512808 vs 0.511581 at α=0.05). Within the 2024–2026 holdout the true roster ≈ the 2026 roster, so PIT's temporal correction buys little — and the thinner rated-active pool (~13–14 players, frequent up-normalisation) and apps-weighting add noise. The differential correlation between PIT and CURRENT is ~0.68, so PIT is a genuine alternative, just a worse one here. Temporal accuracy is not the lever either.

The incumbent's grid remains right-censored (α=0.10 still lowers Brier, to 0.510870) — consistent with the fitted-α note — but α=0.10 fails the ECE half of the gate (8.72pp vs the 8.01pp ceiling), so it is not a ship either.

Decision

No ship. Keep the hand-set α = 0.05 current-composite offset (Model 16) unchanged. The coverage hypothesis from the fitted-α note is not supported by the evidence available:

  • A real point-in-time composite at the same coverage is worse than the shipped fixed-roster composite (temporal accuracy doesn't help inside the 2024–2026 holdout).
  • An optimistic, contaminated full-coverage proxy still fails to beat the incumbent, and shows that coverage scaled with a weak/redundant composite shrinks the per-baseline gain rather than growing it.

The honest lever, if any, is composite quality on the covered fixtures plus genuinely independent signal for new nations — i.e. acquiring real per-player ratings + point-in-time rosters for ~100 non-WC nations (a large pull), not simply turning the offset on for more matches. Until such data exists, broadening coverage is not worth funding on the strength of these numbers.

Caveats

  • Negative result, cleanly derived. The CURRENT baselines reproduce the incumbent bit-for-bit; the PIT arm is honest within the holdout window (finding 2); the synthetic arm is explicitly labelled contaminated and ship-ineligible everywhere it appears.
  • The synthetic arm is an upper bound on "coverage alone", not a realistic broadened composite. Its proxy is DC-derived (redundant), so a real broadened composite could do better per match — but we cannot test that without the missing non-WC data, and the redundancy is precisely why "just turn the offset on everywhere" is not a free win.
  • No production artefact was mutated. team_composite_sum.csv, dixon_coles.json, hierarchical_poisson.json retain their shipped state; this experiment writes only gitignored sidecars.
  • Metrics are on the DC+HP mean, not the full Elo+DC+HP ensemble — same target as the incumbent so the offset effect isn't diluted by Elo. Folded into the full ensemble the effect is smaller still.