Forschungsnotiz

Anytime-scorer `start_prob` v2 — predicted-XI layer (default-off)

Status: Not shipped as default. Backtest complete (2026-05-27): predicted-XI does not lift Brier or log-loss over the legacy chain. Flag remains available for opt-in but the default stays `timeline`. See verdict belowBacktest-Datum: 27 May 2026Kostenlos in voller Länge lesbar · 1,772 Wörter

Vollständige Notiz · kostenlos

Status: Not shipped as default. Backtest complete (2026-05-27): predicted-XI does not lift Brier or log-loss over the legacy chain. Flag remains available for opt-in but the default stays timeline. See verdict below. Backtest date: 2026-05-27 Reproducer: scripts/backtest_anytime_scorer.py Code: scripts/build_anytime_scorer.py (new load_predicted_xi_start_probs, layered start_prob chain, --start-prob-source flag) Tests: tests/test_build_anytime_scorer.py::TestLoadPredictedXiStartProbs, tests/test_build_anytime_scorer.py::TestBuildPredictedXiLayer, tests/test_backtest_anytime_scorer.py

Hypothesis

Model 5 (scripts/build_anytime_scorer.py) produces P(player scores ≥ 1 across the WC tournament). The headline depends on E[minutes], which is derived from start_prob (the per-match starter likelihood). The v1 chain was:

  1. Timeline — recency-weighted appearances in data/wc2026/intl_career_timeline.csv (apps in years ≥ 2024 ÷ team's recent match count).
  2. Caps proxy (fallback) — career caps ÷ team's match count in the last 2 years.

The repo now carries Model #4's web/public/predicted_squads.json (a 26-man squad per WC nation split into starting_xi and bench). It is the model's best forward-looking guess at who will be in the WC squad and who will start. The hypothesis is that using that explicit XI signal as a higher-precision start_prob layer should beat the recency-weighted timeline aggregate, especially for borderline cases (a veteran with low recent apps who Model #4 expects to start anyway; a young player with no caps but in Model #4's projected XI).

Layered chain (proposed v2)

start_prob = predicted_xi(team, player)   # 1248 of ~7585 squad rows (~16%)
          ?? timeline(player, team)        # ~60% of the remainder
          ?? caps_proxy(player, team)      # last-resort fallback

The predicted-XI bands are deliberately uncertain — we have no fixture-level lineup yet, just season-level squad projection:

squad_statusstarterbench
predicted (Model #4 has a confident XI)0.850.10
wikipedia_pool (no Model #4 XI; fallback split)0.750.15

A starter at 0.85 → E[minutes] ≈ 67. A bench player at 0.10 → E[minutes] ≈ 26. Once fixture-level lineups land (FIFA submits them ~75 min before kickoff), a separate code path would override these defaults with 1.0/0.0 for the actual XI vs bench.

Data sources used

SourceFileCoverage
Predicted XI / benchweb/public/predicted_squads.json48 teams; 528 starters + 720 bench (= 1248 player-rows)
Timeline (v1)data/wc2026/intl_career_timeline.csv1,563 unique players; 8,449 player-year rows
Caps proxy (v1)data/wc2026/squads.csv7,585 player-team rows (multiple snapshots)
Squad-announce statusweb/public/squad_announcements.json14 teams flagged final or provisional as of 2026-05-25
Wyscout starting lineupsdata/wc2026/wyscout_lineups.csv115 historical matches, ~2,530 starter rows. No goalscorer events on disk — cannot drive a true outcome backtest.

The squad_announcements.json table records when a team published its final squad, not who is in it. It's not consumed by this PR; flagged here for follow-up.

Backtest

Setup

scripts/backtest_anytime_scorer.py holds every input identical across the three start_prob strategies — same npxG, same e_matches=3.0, same opp factor (1.0), no injuries — and varies ONLY the start_prob layer:

SourceInputs
capssquads.csv.intl_caps ÷ team's recent match count (legacy v0).
timelineintl_career_timeline.csv apps in 2022-2024, truncated at the cutoff (legacy v1).
predicted_xipredicted_squads.json band, then timeline, then caps (proposed v2).

Outcome: y = 1 if the player has ≥ 1 international goal recorded in 2025 (per intl_career_timeline.csv), else y = 0. Only players with a 2025 row are scored (don't punish the model for correctly assigning low probability to absent players).

Results

| Source         |   Brier |  LogLoss |     n |
|----------------|---------|----------|-------|
| caps           |  0.3049 |   1.1677 |  1278 |
| timeline       |  0.3151 |   1.2712 |  1278 |
| predicted_xi   |  0.3161 |   1.3010 |  1278 |

Lift (negative = the new layer improves Brier):
  predicted_xi_vs_caps_brier_delta                  +0.0112
  predicted_xi_vs_timeline_brier_delta              +0.0010
  timeline_vs_caps_brier_delta                      +0.0102

Persisted JSON: /tmp/anytime_backtest.json (ephemeral; re-run with --json).

Reading the numbers

The model is wildly under-confident across all three arms — the base rate is 0.35 but every arm's mean prediction is ~0.03. That's expected: the headline p_score_tournament is calibrated for a 7-match WC bracket against tough international defences, applied here to a "scored in any 2025 international fixture (friendly, qualifier, continental cup)" outcome. The absolute Brier numbers are not what the headline targets; only the relative ordering of the three arms is informative.

On that relative ordering, caps slightly beats timeline slightly beats predicted_xi. Why?

  • caps is a continuous signal spanning hundreds of distinct values. The proxy correlates strongly with the player's career history of being a regular goalscorer, which is itself the best predictor of "did they score in a 2025 fixture".
  • predicted_xi is a 4-bucket discrete signal (0.85 / 0.75 / 0.10 / 0.15). It cannot rank-order players within a starter group — Sørloth, Kane, and Embolo all get the same 0.85, so the npxG/90 term ends up doing all the work. The intra-bucket ranking that caps provides is lost.
  • The outcome itself is goal-rate-dominated, not minutes-dominated. In 2025 most players got SOME minutes (apps > 0 base rate = 0.98), so a start_prob layer that distinguishes "starter vs bench" has limited gain. The model's value-add would show up against an outcome like "scored on game 1 of a 7-match WC schedule".

Contamination caveat (read this)

All three arms leak future information to varying degrees — the backtest is a relative comparison of three contaminated signals, not a clean held-out evaluation. Concretely:

  • predicted_xi is the most contaminated. predicted_squads.json was built from the May 2026 squad pool, after the 2025 outcomes occurred. Using today's predicted XI as a 2024-cutoff forecast leaks 6-12 months of "we know who ended up featuring" into the arm. The arm is therefore advantaged relative to a real forward-looking deployment.
  • caps is also contaminated. squads.csv.intl_caps is the most-recent snapshot per player (the loader explicitly takes keep="last" on snapshot_date), so intl_caps reflects caps accumulated through May 2026 — including all of 2025. The caps arm sees the same 2025 history as predicted_xi, just through a different aggregate. Building a truly 2024-cutoff caps figure would require historical Wikipedia squad snapshots we don't currently retain.
  • timeline is the LEAST contaminated. _truncate_timeline_to_cutoff(timeline, 2024) strictly removes 2025+ rows before computing the recency-weighted aggregate. The timeline arm is the closest thing we have to an honest 2024-cutoff signal.

That predicted_xi STILL loses to caps on this contaminated backtest is therefore a weaker signal than it looks — both arms see 2025 information; the caps arm just happens to embed it as a continuous value while predicted_xi compresses it to a four-bucket band. The honest read is that the backtest cannot distinguish the two arms reliably; it can only rule out a large lift from predicted_xi over the legacy chain.

A clean backtest would need (a) a historical predicted-XI snapshot (Model #4 output as of, say, 2024-12-31) and (b) a historical squads snapshot for the caps denominator. Both are recoverable from git history of web/public/predicted_squads.json and data/wc2026/squads.csv respectively — left as future work.

What we cannot backtest

A true "did this player score in a WC fixture" backtest needs per-match goal-event attribution. We checked:

  • data/raw/wyscout/ — directory empty; events are documented but not on disk.
  • data/raw/intl/results.csv — match-level scores, no scorer attribution.
  • data/wc2026/wyscout_lineups.csv — 115 historical tournament starting XIs (gold standard for the start-prob signal), no goal events.
  • data/wc2026/intl_career_timeline.csv — per-player per-year goals, not per-match.

To run a true backtest the next pass would need to either (a) re-pull the cached data/raw/wyscout/events/<match_id>.json events (per scripts/pull_wyscout_open.py), or (b) ingest the FotMob match-detail scorer data.

Decision

Ship the layered start_prob behind a flag, default OFF. Concretely:

  • scripts/build_anytime_scorer.py --start-prob-source timeline — default; preserves v1 behaviour exactly.
  • scripts/build_anytime_scorer.py --start-prob-source predicted_xi — opts into the new layered chain.
  • scripts/build_anytime_scorer.py --start-prob-source caps — full ablation; reproduces the pre-D5 v0 behaviour for testing.

The output CSV gets a new start_prob_source column attributing each player's layer ("predicted_xi_start", "predicted_xi_bench", "timeline", or "caps"), so a reader of /scorers/ can see which signal drives each headline regardless of the active flag.

The default stays timeline because:

  1. The contaminated backtest does NOT show predicted_xi lift, so flipping the default would ship a regression on the only metric we can measure today.
  2. The predicted-XI band shape (0.85 / 0.10) is more interpretable for a reader (a player's headline reflects a model-of-Model-#4 forecast rather than a coarse career-caps aggregate), but interpretability gains don't justify a measurable Brier regression.
  3. Flipping the flag in a follow-up is one line; rolling back a default change after the CSV ships and feeds downstream pages (/scorers/, predicted_squads.json consumers) is several lines and risk of stale-cache fanout.

Verdict (2026-05-28)

Closing this investigation. The predicted-XI layer does not ship as the default.

The contaminated backtest (the only one available without substantial historical-snapshot recovery work) shows predicted_xi as the worst of the three arms. Since the contaminated version has strictly more information available than a clean historical backtest would (it leaks 2025 outcomes into the predicted-XI arm), the clean version cannot produce a better result for predicted_xi. The ordering is therefore robust: caps > timeline > predicted_xi.

The fundamental limitation is structural, not data-quality: the predicted-XI signal compresses a continuous start-probability into 4 discrete bands (0.85/0.75/0.10/0.15), discarding the intra-bucket ranking that caps provides via hundreds of distinct values. The outcome being measured (scored in any 2025 international fixture) is goal-rate-dominated rather than minutes-dominated, which further reduces the minutes-precision advantage the predicted-XI layer was designed to provide.

The --start-prob-source predicted_xi flag remains available for opt-in experimentation but is not the production default.

Follow-ups (deprioritised)

These are preserved for completeness but are no longer blocking any decision:

  1. Confirmed-XI override — the only path to a real outcome lift. Once a team's actual match-day XI is published (~75 min before kickoff), a code path should set start_prob = 1.0 for the 11 / 0.05 for benched players. This is worth doing during the tournament itself regardless of the predicted-XI result.
  2. Continuous predicted_xi probability — Model #4's rank_in_position field could soften the 4-bucket bands to a continuous score. Moot unless follow-up #1 shows minutes-precision matters more than the current backtest suggests.
  3. Per-match goal events — re-hydrate data/raw/wyscout/events/ to score P(player scores in match M) directly. Would test the right outcome but requires data work and is blocked on Wyscout event access.

Compliance check

The change is internal to Model #5 — no new UI surface, no exposed market comparison, no betting language in any output. The CSV column rename start_prob_source is research-vocabulary. Per CLAUDE.md §12 the change ships clean: no operator names, no edge calculations, no betting vocabulary, no track-record framing.