연구 노트

Can we model the game *script*? Design for a within-match game-state layer

상태: Design only. No code written, no fit run, no decision taken전문 무료 공개 · 1,535단어

전문 · 무료

Status: Design only. No code written, no fit run, no decision taken. Author date: 2026-06-02 Companion code (current): scripts/fit_dixon_coles.py, scripts/ensemble.py, scripts/backtest_composite_offset.py (walk-forward + acceptance-gate harness template), scripts/build_team_style_vectors.py (StatsBomb event-feature extraction we'd reuse).

Why this exists

Our forecasting stack — ClubElo / FIFA-Elo + Dixon-Coles + Hierarchical Poisson, plus the small shipped composite-α and GK-α offsets — predicts a match from each team's long-run strength. It has no representation of how a single game unfolds once it kicks off. Every minute of a match is exchangeable to it; a 1–0 lead at the 6th minute and a 1–0 lead at the 85th carry identical information.

The 2025-26 Champions League final was a clean illustration of what that misses. Our pre-match baseline made Arsenal the stronger side (1.73 expected goals to PSG's 0.98). The match inverted that: PSG created 1.77 xG to Arsenal's 0.48, 21 shots to 7, ~75% possession — and it still finished 1–1, because Arsenal scored early and defended a deep block for 110 minutes. The team that "should" have created more got out-created two-to-one, and an aggregate-strength model cannot see why: the cause was a game script (early goal → low block → territory without the second goal), and game scripts live inside the match, not in the season-long team parameters.

This note asks whether a within-match game-state layer — scoring intensity conditional on the current scoreline and time remaining — is worth building, and proposes how we would test it honestly. It does not propose building it yet; like the state-space and style-matchup notes before it, the first artifact is a falsifiable proposal.

Hypothesis

The instantaneous goal-scoring rate in a match is not constant at the Dixon-Coles team means (λ_home, λ_away). It is modulated by game state — the current goal difference and the time elapsed. Specifically:

  • A team that leads tends to reduce its own scoring rate and concede territory (the "protect the lead" effect), more strongly as time runs out and more strongly the larger the lead.
  • A trailing team raises its attacking rate (the "chase" effect), again intensifying with time.
  • These are state-transition effects, not team-identity effects: they apply on top of whatever (λ_home, λ_away) the team-level fit already produced.

If these effects are real and estimable, a forecast that integrates over the within-match state path should be better calibrated on the outcome distribution — particularly on the draw and narrow-margin probabilities, which is exactly where a lead-protection effect bites and exactly where the UCL final landed (modal scoreline 1–1, correct; headline winner, wrong).

Why this is distinct from what we already tried (and what failed)

This is the important part. Three adjacent directions have already been built and backtested in this repo, and all three were negative or tiny. The game-state hypothesis is only worth raising because it is genuinely orthogonal to all three — it targets a signal none of them modelled.

  1. State-space Dixon-Coles (state-space-dc-result.md, NOT SHIPPED, Brier +31bp at best). That work let team parameters drift between matches across seasons (2018 Argentina ≠ 2022 Argentina). It is a between-match time-variation model. Game state is a within-match effect — orthogonal axis. State-space failed on corpus sparsity (~22 matches/team/walk); the game-state layer does not add per-team parameters (see "Parameterisation" below), so it does not inherit that specific sparsity failure mode.

  2. Style-matchup model (style-matchup-fit.md, NOT SHIPPED, Brier flat-to-+33bp). That fit static per-style-pair offsets (possession vs transition, etc.) — a pre-match team-level descriptor. It failed because the per-style-pair residual was drowned by noise (~16 matches/cell) and because the shipped composite-α / GK-α offsets already absorb team-level signal. Game state is not a team descriptor at all; it is a within-match dynamic that the style work never touched.

  3. xG-as-response (methodology.md §2, NEGATIVE gate). Substituting match xG for goals failed because only ~143 of ~5,000 matches carry xG. This is the most relevant precedent for our data risk (below), because the only within-match data we have is the same StatsBomb open-data corpus.

The shipped player offsets (composite-α −2.29bp, GK-α −1.16bp) are team-aggregated difficulty adjustments — also orthogonal. Nothing in the stack models the in-match state transition. That gap is the whole idea.

Parameterisation (kept deliberately cheap)

The sparsity precedents dictate the design: do not add per-team or per-style parameters. Estimate a single, global set of game-state multipliers shared across all teams — the cheapest possible test of the hypothesis, mirroring how the state-space note tested its idea with the minimal EMA variant first.

Proposed form. Let the instantaneous scoring rate for the team in question be

λ(t, d) = λ_base · exp( γ_lead · f_lead(d, t) + γ_chase · f_chase(d, t) )

where d is that team's current goal difference (lead positive), t is minutes elapsed, λ_base is the team's Dixon-Coles mean rate, and f_lead / f_chase are simple monotone shape functions (e.g. d · t/90 clamped) with two global coefficients (γ_lead, γ_chase). Two parameters total, not per team. The match outcome distribution is obtained by integrating the inhomogeneous-Poisson state path (closed-form over the discretised minute grid, or a fast Monte-Carlo over the joint process) and comparing to the stationary Dixon-Coles grid.

Two global parameters fit on hundreds of matches is a completely different sparsity regime from per-team drift (~22 matches/team) or per-style-pair cells (~16 matches/cell). That is the design's central bet: the effect is shared enough across teams to be estimable where the team-specific effects were not.

The data problem (stated plainly, up front)

This is where the idea is most at risk, and the precedent is discouraging.

To fit game-state effects we need within-match event timing — at minimum, goal timestamps; ideally shot-level events with minutes. What we actually have:

  • International goal timing: essentially none on disk. xg.csv is match aggregates only. wyscout_lineups.csv has starters but no goal events, no scorer attribution, no minutes. There is no goal-timestamp table for the international corpus.
  • StatsBomb open-data event timelines: ~6 tournaments. The same source the style vectors use (WC 2018/2022, Euro 2020/2024, Copa 2024, AFCON 2023) carries shot and goal events with minutes. That is the only within-match grounding available — and it is ~6 tournaments, a few hundred matches, the same thin corpus that sank xG-as-response.
  • Club data: ClubElo gives ratings but no event timelines; the UCL live-tracker pulled events from API-Football for one match.

So the honest position is: the modelling is cheap (two global parameters), but the fitting corpus is the binding constraint, and the most similar precedent (xG-as-response on the same StatsBomb corpus) returned a negative gate purely on sample size. A companion data-feasibility probe (does post-first-goal scoring intensity measurably shift, and across how many matches with usable timing?) should run before any model is built — quantify the prior the way the state-space note wishes it had.

Acceptance gate (falsifiable, same convention as every shipped model)

Reuse the repo's standard conjunction gate (scripts/metrics.py apply_conjunction_gate): a candidate clears iff, across the 8-fold walk-forward backtest,

  • median Brier is strictly lower than the stationary Dixon-Coles baseline (by more than the ~0.5bp distinguishable-from-noise floor), AND
  • median ECE is within +0.2pp of baseline,

evaluated on both raw and isotonic-calibrated metrics, and monotone across the (γ_lead, γ_chase) grid (a lone interior peak with degraded neighbours signals overfitting and fails, per the GK-offset and state-space precedents).

A secondary, more sensitive read: because the hypothesised effect concentrates on draws and narrow margins, also report per-outcome calibration on the draw leg and scoreline-level log-loss on the top-N exact scorelines — the headline 3-way Brier may be too blunt to register a real but narrow improvement, and we should not let a blunt metric hide a true signal (nor let a narrow metric manufacture one; the conjunction gate above remains the ship/no-ship decision).

What a result would mean either way

  • Clears the gate: we'd have the first in-match-dynamics signal in the stack, and a principled way to publish a sharper draw / narrow-margin probability — directly the dimension the UCL final exposed.
  • Fails the gate (the base-rate-likely outcome): a documented negative result joining state-space and style-matchup, with a precise cause (effect too small at this corpus size, or absorbed by the team-level fit). That is genuinely useful — it closes a direction readers and we ourselves will otherwise keep proposing, and it sharpens the public honesty about why the model can't see game scripts (it is a data problem, not an oversight).

Either way the artifact is research-coded calibration work: probabilities and methodology only, no odds, no market comparison, no betting-action framing (COMPLIANCE.md §3).

Decision

No build yet. This note is the proposal. The recommended next step is the cheap data-feasibility probe (does the StatsBomb timing corpus show a measurable, estimable game-state effect, and on how many matches), gated before committing to the model fit — the state-space note's main retrospective regret was building the harness before quantifying the sparsity prior. If the probe shows the effect is present and the usable corpus is larger than the xG-response corpus that already failed, proceed to the two-parameter global fit and the gate above. If not, record the negative feasibility result and stop.