연구 노트

Does a player-form (momentum) offset improve match forecasts? (No)

상태: Not shipped. Negative result (2026-05-29). A point-in-time백테스트 날짜: 29 May 2026전문 무료 공개 · 1,189단어

전문 · 무료

Status: Not shipped. Negative result (2026-05-29). A point-in-time player-form differential offset Δ = α·(form_home − form_away) does not beat the no-offset baseline at any tested α — every nonzero α slightly worsens median Brier. No production artefact was changed; the form score remains a display-only feature (PR #659).

Backtest date: 2026-05-29 Reproducer: scripts/backtest_form_offset.py --folds 8 Build: scripts/build_player_form.py (form_delta_by_player_season) Tests: tests/test_build_player_form.py, tests/test_backtest_form_offset.py Sidecar (gitignored): data/wc2026/form_offset_backtest.json Follows: documentation/research-notes/player-strength-fitted-alpha.md, documentation/research-notes/composite-coverage-backfill.md

Hypothesis under test

The current-form score (scripts/build_player_form.py) measures momentum — a player's most recent full club season vs their own multi-season baseline — deliberately orthogonal to the composite rating (which is the level). The claim: a team's XI form, differenced between the two sides, carries short-term predictive signal the level-based composite offset misses, so Δ = α·(form_home − form_away) on the DC + HP Poisson log-rates should lower Brier without hurting calibration.

Verdict: it does not. The gate fails cleanly; the offset is inert-to-harmful.

The hard correctness constraint: point-in-time form

Form is time-varying by construction, so — unlike the composite offset, which applies one current team snapshot to all historical matches — a current form snapshot cannot be used in a backtest: it would leak 2025/26 performance into 2023–2024 holdout matches. Any Brier gain from that would be a lookahead artefact (cf. the gk-offset "+44bp was fake" lesson).

So the backtest uses form_delta_by_player_season: for each match, each player's form is computed only from club seasons that had completed before that match (season_cutoff() maps a match date to the most recent finished season). Squad membership is the current XI — the same mild anachronism the composite offset already accepts; reconstructing historical national-team XIs would need per-date squad snapshots we don't have. Only the form values are point-in-time, and a no-lookahead unit test pins that the season-S value never changes when a future season is added.

Result — 8×90-day walk-forward, DC+HP uniform average

αmedian Briermedian ECE
0.00 (baseline)0.514020.07406
0.050.514140.07314
0.100.514260.07417
0.200.514510.07424
0.400.515060.07370

Median Brier is monotonically increasing in α — the best candidate is α = 0, so the conjunction gate (median Brier strictly down AND median ECE within +0.2pp) fails by the first half. The form signal does not improve the forecast; pushed harder it degrades it.

Why: not enough data where it counts (coverage, not coefficient)

This is the same wall the composite-coverage note hit, sharper. The offset can only fire on a match when both teams have an XI form value as of that date. Form is built from Big-5 club data only (≈24% of squad players; nine WC squads have zero coverage), so across the holdout the offset fired on a minority of matches per walk — roughly 6/89 to 40/501 (~7–30%). The remaining majority get an identical prediction at every α, so α is effectively judged on a small, lumpy subset.

Two compounding shortages drive this:

  1. No per-player international data. intl_match_xg is team-level, so form must be derived from club seasons — an indirect link to international results.
  2. The club data is Big-5-only. Most international fixtures involve at least one team with no covered XI, so the offset never engages.

The consequence is not merely "low impact" — it is underpowered: the deciding subset is too small and noisy for a real effect to clear the median-of-8 gate, and a "pass" here would more likely be an artefact than a signal. Combined with the well-documented tendency of form to revert quickly beyond a few matches (documentation/player-quality.md), the honest reading is that the current corpus cannot resolve a form effect, if one exists.

What would change the answer

  • A non-Big-5 per-player source (FotMob / soccerdata, per the pull_fotmob_team_season_stats.py direction) to extend form coverage past the Big-5 so the offset fires on a representative share of fixtures.
  • Historical squad snapshots for true point-in-time team membership, removing the current-XI anachronism.

Until both land, more model sophistication on this axis keeps failing the gate — the corpus ceiling, again. The form score stays a display feature; the offset is documented here as a no-ship and left disabled.

Follow-up (2026-05-30): the coverage ceiling, removed — still a no-ship

The "what would change the answer" prescription above was tested directly. A non-Big-5 per-player source now exists — scripts/pull_fotmob_player_form.py pulls FotMob international-competition rating (qualifiers + continental cups + World Cups, all six confederations) into data/wc2026/player_form_fotmob.csv, joined to player_id by a name+country crosswalk. build_player_form_intl.py turns it into the same point-in-time {player_id: {season: delta}} structure (recent-vs-baseline EWMA on rating), and backtest_form_offset.py gained a --form-source {club,intl,merged} switch. The coverage ceiling is gone: teams with point-in-time form rose 41 → 48 (all WC nations), and the offset's per-match firing rose from ~7–8% to a ~13% median (peak 21%).

It still does not survive. On the same conjunction gate the intl signal appears to pass (intl-only +27.1bp 8-fold / +17.5bp 16-fold; club stays α=0/FAIL). But a placebo test (scripts/validate_form_offset.py: fit the models once, then re-evaluate the gate against the same form timeseries shuffled across teams — identity broken, distribution preserved) shows the "pass" is not real:

8 folds16 folds
Real intl Δ Brier−27.1 bp−17.5 bp
Real ECE Δ−0.30 pp+0.09 pp
Walks the offset helped3 / 89 / 16
Placebo gate-pass rate42 %46 %
Empirical p (placebos ≥ real)0.140.08

Random team↔form assignments clear the gate ~45% of the time, and the real effect sits at p ≈ 0.08–0.14 — not significant, point estimate unstable across fold counts, ECE sign flips, and only about half the walks help. More holdout (8 → 16 folds) did not tighten it.

Two conclusions:

  1. Coverage was necessary but not the binding constraint. Removing the Big-5 ceiling moved the offset from "α = 0 optimal" to "marginal but insignificant" — not to a win. The form signal is intrinsically weak (it reverts fast and team strength is already priced by DC/HP/Elo). This refines the original "corpus ceiling" reading: corpus power still binds, but adding form coverage does not unlock a real effect, so this axis is closed. (Cf. composite-coverage-backfill.md — coverage was not the lever there either.)

  2. The conjunction gate is too permissive at this holdout size. "Median Brier strictly lower at α = 0.05" is ≈ a coin flip, so the gate alone passes ~45% of noise — the same failure mode that produced the fake gk-offset "+44bp". A placebo / shuffled-form significance check (validate_form_offset.py) is now the standard guard for any offset gate; see documentation/methodology.md ("Offset-gate placebo guard").

The FotMob form pull is retained as a display feature (non-Big-5 player current-form, surfaces that show no form today), carrying an explicit "international form" basis label and no model claim — the same descriptive-only treatment as squad cohesion. The offset remains a no-ship.