Status: Not shipped as default. Backtest complete (2026-05-27): predicted-XI does not lift Brier or log-loss over the legacy chain. Flag remains available for opt-in but the default stays timeline. See verdict below.
Backtest date: 2026-05-27
Reproducer: scripts/backtest_anytime_scorer.py
Code: scripts/build_anytime_scorer.py (new load_predicted_xi_start_probs, layered start_prob chain, --start-prob-source flag)
Tests: tests/test_build_anytime_scorer.py::TestLoadPredictedXiStartProbs, tests/test_build_anytime_scorer.py::TestBuildPredictedXiLayer, tests/test_backtest_anytime_scorer.py
Hypothesis
Model 5 (scripts/build_anytime_scorer.py) produces P(player scores ≥ 1 across the WC tournament). The headline depends on E[minutes], which is derived from start_prob (the per-match starter likelihood). The v1 chain was:
- Timeline — recency-weighted appearances in
data/wc2026/intl_career_timeline.csv(apps in years ≥ 2024 ÷ team's recent match count). - Caps proxy (fallback) — career caps ÷ team's match count in the last 2 years.
The repo now carries Model #4's web/public/predicted_squads.json (a 26-man squad per WC nation split into starting_xi and bench). It is the model's best forward-looking guess at who will be in the WC squad and who will start. The hypothesis is that using that explicit XI signal as a higher-precision start_prob layer should beat the recency-weighted timeline aggregate, especially for borderline cases (a veteran with low recent apps who Model #4 expects to start anyway; a young player with no caps but in Model #4's projected XI).
Layered chain (proposed v2)
start_prob = predicted_xi(team, player) # 1248 of ~7585 squad rows (~16%)
?? timeline(player, team) # ~60% of the remainder
?? caps_proxy(player, team) # last-resort fallback
The predicted-XI bands are deliberately uncertain — we have no fixture-level lineup yet, just season-level squad projection:
squad_status | starter | bench |
|---|---|---|
predicted (Model #4 has a confident XI) | 0.85 | 0.10 |
wikipedia_pool (no Model #4 XI; fallback split) | 0.75 | 0.15 |
A starter at 0.85 → E[minutes] ≈ 67. A bench player at 0.10 → E[minutes] ≈ 26. Once fixture-level lineups land (FIFA submits them ~75 min before kickoff), a separate code path would override these defaults with 1.0/0.0 for the actual XI vs bench.
Data sources used
| Source | File | Coverage |
|---|---|---|
| Predicted XI / bench | web/public/predicted_squads.json | 48 teams; 528 starters + 720 bench (= 1248 player-rows) |
| Timeline (v1) | data/wc2026/intl_career_timeline.csv | 1,563 unique players; 8,449 player-year rows |
| Caps proxy (v1) | data/wc2026/squads.csv | 7,585 player-team rows (multiple snapshots) |
| Squad-announce status | web/public/squad_announcements.json | 14 teams flagged final or provisional as of 2026-05-25 |
| Wyscout starting lineups | data/wc2026/wyscout_lineups.csv | 115 historical matches, ~2,530 starter rows. No goalscorer events on disk — cannot drive a true outcome backtest. |
The squad_announcements.json table records when a team published its final squad, not who is in it. It's not consumed by this PR; flagged here for follow-up.
Backtest
Setup
scripts/backtest_anytime_scorer.py holds every input identical across the three start_prob strategies — same npxG, same e_matches=3.0, same opp factor (1.0), no injuries — and varies ONLY the start_prob layer:
| Source | Inputs |
|---|---|
caps | squads.csv.intl_caps ÷ team's recent match count (legacy v0). |
timeline | intl_career_timeline.csv apps in 2022-2024, truncated at the cutoff (legacy v1). |
predicted_xi | predicted_squads.json band, then timeline, then caps (proposed v2). |
Outcome: y = 1 if the player has ≥ 1 international goal recorded in 2025 (per intl_career_timeline.csv), else y = 0. Only players with a 2025 row are scored (don't punish the model for correctly assigning low probability to absent players).
Results
| Source | Brier | LogLoss | n |
|----------------|---------|----------|-------|
| caps | 0.3049 | 1.1677 | 1278 |
| timeline | 0.3151 | 1.2712 | 1278 |
| predicted_xi | 0.3161 | 1.3010 | 1278 |
Lift (negative = the new layer improves Brier):
predicted_xi_vs_caps_brier_delta +0.0112
predicted_xi_vs_timeline_brier_delta +0.0010
timeline_vs_caps_brier_delta +0.0102
Persisted JSON: /tmp/anytime_backtest.json (ephemeral; re-run with --json).
Reading the numbers
The model is wildly under-confident across all three arms — the base rate is 0.35 but every arm's mean prediction is ~0.03. That's expected: the headline p_score_tournament is calibrated for a 7-match WC bracket against tough international defences, applied here to a "scored in any 2025 international fixture (friendly, qualifier, continental cup)" outcome. The absolute Brier numbers are not what the headline targets; only the relative ordering of the three arms is informative.
On that relative ordering, caps slightly beats timeline slightly beats predicted_xi. Why?
capsis a continuous signal spanning hundreds of distinct values. The proxy correlates strongly with the player's career history of being a regular goalscorer, which is itself the best predictor of "did they score in a 2025 fixture".predicted_xiis a 4-bucket discrete signal (0.85 / 0.75 / 0.10 / 0.15). It cannot rank-order players within a starter group — Sørloth, Kane, and Embolo all get the same 0.85, so the npxG/90 term ends up doing all the work. The intra-bucket ranking thatcapsprovides is lost.- The outcome itself is goal-rate-dominated, not minutes-dominated. In 2025 most players got SOME minutes (
apps > 0base rate = 0.98), so astart_problayer that distinguishes "starter vs bench" has limited gain. The model's value-add would show up against an outcome like "scored on game 1 of a 7-match WC schedule".
Contamination caveat (read this)
All three arms leak future information to varying degrees — the backtest is a relative comparison of three contaminated signals, not a clean held-out evaluation. Concretely:
predicted_xiis the most contaminated.predicted_squads.jsonwas built from the May 2026 squad pool, after the 2025 outcomes occurred. Using today's predicted XI as a 2024-cutoff forecast leaks 6-12 months of "we know who ended up featuring" into the arm. The arm is therefore advantaged relative to a real forward-looking deployment.capsis also contaminated.squads.csv.intl_capsis the most-recent snapshot per player (the loader explicitly takeskeep="last"onsnapshot_date), sointl_capsreflects caps accumulated through May 2026 — including all of 2025. The caps arm sees the same 2025 history aspredicted_xi, just through a different aggregate. Building a truly 2024-cutoff caps figure would require historical Wikipedia squad snapshots we don't currently retain.timelineis the LEAST contaminated._truncate_timeline_to_cutoff(timeline, 2024)strictly removes 2025+ rows before computing the recency-weighted aggregate. The timeline arm is the closest thing we have to an honest 2024-cutoff signal.
That predicted_xi STILL loses to caps on this contaminated backtest is therefore a weaker signal than it looks — both arms see 2025 information; the caps arm just happens to embed it as a continuous value while predicted_xi compresses it to a four-bucket band. The honest read is that the backtest cannot distinguish the two arms reliably; it can only rule out a large lift from predicted_xi over the legacy chain.
A clean backtest would need (a) a historical predicted-XI snapshot (Model #4 output as of, say, 2024-12-31) and (b) a historical squads snapshot for the caps denominator. Both are recoverable from git history of web/public/predicted_squads.json and data/wc2026/squads.csv respectively — left as future work.
What we cannot backtest
A true "did this player score in a WC fixture" backtest needs per-match goal-event attribution. We checked:
data/raw/wyscout/— directory empty; events are documented but not on disk.data/raw/intl/results.csv— match-level scores, no scorer attribution.data/wc2026/wyscout_lineups.csv— 115 historical tournament starting XIs (gold standard for the start-prob signal), no goal events.data/wc2026/intl_career_timeline.csv— per-player per-yeargoals, not per-match.
To run a true backtest the next pass would need to either (a) re-pull the cached data/raw/wyscout/events/<match_id>.json events (per scripts/pull_wyscout_open.py), or (b) ingest the FotMob match-detail scorer data.
Decision
Ship the layered start_prob behind a flag, default OFF. Concretely:
scripts/build_anytime_scorer.py --start-prob-source timeline— default; preserves v1 behaviour exactly.scripts/build_anytime_scorer.py --start-prob-source predicted_xi— opts into the new layered chain.scripts/build_anytime_scorer.py --start-prob-source caps— full ablation; reproduces the pre-D5 v0 behaviour for testing.
The output CSV gets a new start_prob_source column attributing each player's layer ("predicted_xi_start", "predicted_xi_bench", "timeline", or "caps"), so a reader of /scorers/ can see which signal drives each headline regardless of the active flag.
The default stays timeline because:
- The contaminated backtest does NOT show predicted_xi lift, so flipping the default would ship a regression on the only metric we can measure today.
- The predicted-XI band shape (0.85 / 0.10) is more interpretable for a reader (a player's headline reflects a model-of-Model-#4 forecast rather than a coarse career-caps aggregate), but interpretability gains don't justify a measurable Brier regression.
- Flipping the flag in a follow-up is one line; rolling back a default change after the CSV ships and feeds downstream pages (
/scorers/,predicted_squads.jsonconsumers) is several lines and risk of stale-cache fanout.
Verdict (2026-05-28)
Closing this investigation. The predicted-XI layer does not ship as the default.
The contaminated backtest (the only one available without substantial historical-snapshot recovery work) shows predicted_xi as the worst of the three arms. Since the contaminated version has strictly more information available than a clean historical backtest would (it leaks 2025 outcomes into the predicted-XI arm), the clean version cannot produce a better result for predicted_xi. The ordering is therefore robust: caps > timeline > predicted_xi.
The fundamental limitation is structural, not data-quality: the predicted-XI signal compresses a continuous start-probability into 4 discrete bands (0.85/0.75/0.10/0.15), discarding the intra-bucket ranking that caps provides via hundreds of distinct values. The outcome being measured (scored in any 2025 international fixture) is goal-rate-dominated rather than minutes-dominated, which further reduces the minutes-precision advantage the predicted-XI layer was designed to provide.
The --start-prob-source predicted_xi flag remains available for opt-in experimentation but is not the production default.
Follow-ups (deprioritised)
These are preserved for completeness but are no longer blocking any decision:
- Confirmed-XI override — the only path to a real outcome lift. Once a team's actual match-day XI is published (~75 min before kickoff), a code path should set
start_prob = 1.0for the 11 /0.05for benched players. This is worth doing during the tournament itself regardless of the predicted-XI result. - Continuous predicted_xi probability — Model #4's
rank_in_positionfield could soften the 4-bucket bands to a continuous score. Moot unless follow-up #1 shows minutes-precision matters more than the current backtest suggests. - Per-match goal events — re-hydrate
data/raw/wyscout/events/to scoreP(player scores in match M)directly. Would test the right outcome but requires data work and is blocked on Wyscout event access.
Compliance check
The change is internal to Model #5 — no new UI surface, no exposed market comparison, no betting language in any output. The CSV column rename start_prob_source is research-vocabulary. Per CLAUDE.md §12 the change ships clean: no operator names, no edge calculations, no betting vocabulary, no track-record framing.