Status: Not shipped. Negative result (2026-05-29). A point-in-time
player-form differential offset Δ = α·(form_home − form_away) does
not beat the no-offset baseline at any tested α — every nonzero α
slightly worsens median Brier. No production artefact was changed; the
form score remains a display-only feature (PR #659).
Backtest date: 2026-05-29
Reproducer: scripts/backtest_form_offset.py --folds 8
Build: scripts/build_player_form.py (form_delta_by_player_season)
Tests: tests/test_build_player_form.py, tests/test_backtest_form_offset.py
Sidecar (gitignored): data/wc2026/form_offset_backtest.json
Follows: documentation/research-notes/player-strength-fitted-alpha.md,
documentation/research-notes/composite-coverage-backfill.md
Hypothesis under test
The current-form score (scripts/build_player_form.py) measures
momentum — a player's most recent full club season vs their own
multi-season baseline — deliberately orthogonal to the composite rating
(which is the level). The claim: a team's XI form, differenced between
the two sides, carries short-term predictive signal the level-based
composite offset misses, so Δ = α·(form_home − form_away) on the DC +
HP Poisson log-rates should lower Brier without hurting calibration.
Verdict: it does not. The gate fails cleanly; the offset is inert-to-harmful.
The hard correctness constraint: point-in-time form
Form is time-varying by construction, so — unlike the composite offset, which applies one current team snapshot to all historical matches — a current form snapshot cannot be used in a backtest: it would leak 2025/26 performance into 2023–2024 holdout matches. Any Brier gain from that would be a lookahead artefact (cf. the gk-offset "+44bp was fake" lesson).
So the backtest uses form_delta_by_player_season: for each match, each
player's form is computed only from club seasons that had completed
before that match (season_cutoff() maps a match date to the most
recent finished season). Squad membership is the current XI — the same
mild anachronism the composite offset already accepts; reconstructing
historical national-team XIs would need per-date squad snapshots we don't
have. Only the form values are point-in-time, and a no-lookahead unit
test pins that the season-S value never changes when a future season is
added.
Result — 8×90-day walk-forward, DC+HP uniform average
| α | median Brier | median ECE |
|---|---|---|
| 0.00 (baseline) | 0.51402 | 0.07406 |
| 0.05 | 0.51414 | 0.07314 |
| 0.10 | 0.51426 | 0.07417 |
| 0.20 | 0.51451 | 0.07424 |
| 0.40 | 0.51506 | 0.07370 |
Median Brier is monotonically increasing in α — the best candidate is α = 0, so the conjunction gate (median Brier strictly down AND median ECE within +0.2pp) fails by the first half. The form signal does not improve the forecast; pushed harder it degrades it.
Why: not enough data where it counts (coverage, not coefficient)
This is the same wall the composite-coverage note hit, sharper. The offset can only fire on a match when both teams have an XI form value as of that date. Form is built from Big-5 club data only (≈24% of squad players; nine WC squads have zero coverage), so across the holdout the offset fired on a minority of matches per walk — roughly 6/89 to 40/501 (~7–30%). The remaining majority get an identical prediction at every α, so α is effectively judged on a small, lumpy subset.
Two compounding shortages drive this:
- No per-player international data.
intl_match_xgis team-level, so form must be derived from club seasons — an indirect link to international results. - The club data is Big-5-only. Most international fixtures involve at least one team with no covered XI, so the offset never engages.
The consequence is not merely "low impact" — it is underpowered: the
deciding subset is too small and noisy for a real effect to clear the
median-of-8 gate, and a "pass" here would more likely be an artefact than
a signal. Combined with the well-documented tendency of form to revert
quickly beyond a few matches (documentation/player-quality.md), the
honest reading is that the current corpus cannot resolve a form effect,
if one exists.
What would change the answer
- A non-Big-5 per-player source (FotMob / soccerdata, per the
pull_fotmob_team_season_stats.pydirection) to extend form coverage past the Big-5 so the offset fires on a representative share of fixtures. - Historical squad snapshots for true point-in-time team membership, removing the current-XI anachronism.
Until both land, more model sophistication on this axis keeps failing the gate — the corpus ceiling, again. The form score stays a display feature; the offset is documented here as a no-ship and left disabled.
Follow-up (2026-05-30): the coverage ceiling, removed — still a no-ship
The "what would change the answer" prescription above was tested directly. A
non-Big-5 per-player source now exists — scripts/pull_fotmob_player_form.py
pulls FotMob international-competition rating (qualifiers + continental cups +
World Cups, all six confederations) into data/wc2026/player_form_fotmob.csv,
joined to player_id by a name+country crosswalk. build_player_form_intl.py
turns it into the same point-in-time {player_id: {season: delta}} structure
(recent-vs-baseline EWMA on rating), and backtest_form_offset.py gained a
--form-source {club,intl,merged} switch. The coverage ceiling is gone:
teams with point-in-time form rose 41 → 48 (all WC nations), and the
offset's per-match firing rose from ~7–8% to a ~13% median (peak 21%).
It still does not survive. On the same conjunction gate the intl signal
appears to pass (intl-only +27.1bp 8-fold / +17.5bp 16-fold; club stays
α=0/FAIL). But a placebo test (scripts/validate_form_offset.py: fit the
models once, then re-evaluate the gate against the same form timeseries
shuffled across teams — identity broken, distribution preserved) shows the
"pass" is not real:
| 8 folds | 16 folds | |
|---|---|---|
| Real intl Δ Brier | −27.1 bp | −17.5 bp |
| Real ECE Δ | −0.30 pp | +0.09 pp |
| Walks the offset helped | 3 / 8 | 9 / 16 |
| Placebo gate-pass rate | 42 % | 46 % |
| Empirical p (placebos ≥ real) | 0.14 | 0.08 |
Random team↔form assignments clear the gate ~45% of the time, and the real effect sits at p ≈ 0.08–0.14 — not significant, point estimate unstable across fold counts, ECE sign flips, and only about half the walks help. More holdout (8 → 16 folds) did not tighten it.
Two conclusions:
-
Coverage was necessary but not the binding constraint. Removing the Big-5 ceiling moved the offset from "α = 0 optimal" to "marginal but insignificant" — not to a win. The form signal is intrinsically weak (it reverts fast and team strength is already priced by DC/HP/Elo). This refines the original "corpus ceiling" reading: corpus power still binds, but adding form coverage does not unlock a real effect, so this axis is closed. (Cf.
composite-coverage-backfill.md— coverage was not the lever there either.) -
The conjunction gate is too permissive at this holdout size. "Median Brier strictly lower at α = 0.05" is ≈ a coin flip, so the gate alone passes ~45% of noise — the same failure mode that produced the fake gk-offset "+44bp". A placebo / shuffled-form significance check (
validate_form_offset.py) is now the standard guard for any offset gate; seedocumentation/methodology.md("Offset-gate placebo guard").
The FotMob form pull is retained as a display feature (non-Big-5 player current-form, surfaces that show no form today), carrying an explicit "international form" basis label and no model claim — the same descriptive-only treatment as squad cohesion. The offset remains a no-ship.