Araştırma notu

Within-match game-state: the corpus is ample, the confounding is the problem

Durum: Feasibility probe complete. The design note's stated risk was wrong; a different risk is binding. Qualified result — read the decision gate at the bottomTamamı ücretsiz okunabilir · 1,223 kelime

Tam not · ücretsiz

Status: Feasibility probe complete. The design note's stated risk was wrong; a different risk is binding. Qualified result — read the decision gate at the bottom. Probe date: 2026-06-03 Reproducer: .venv/bin/python scripts/analysis/gamestate_feasibility.py (network, first run caches ~314 matches), then --no-network to re-aggregate offline. Persisted output: per-match goal timelines under data/raw/statsbomb_gamestate/ and the result JSON via --json-out (both gitignored, regenerable per host). Design note: documentation/research-notes/within-match-gamestate-design.md

The design note proposed a within-match game-state layer — a leading team suppressing its own scoring (lead protection) and a trailing team raising it (the chase) — and named corpus sparsity as the binding risk, by analogy to the xG-as-response gate that failed on ~143 matches. This probe ran the cheap feasibility check the note gated on, before any model is built. It overturns the note's risk assessment in both directions: the corpus is 2× larger than feared and not the constraint, while a risk the note did not flag — game-state is endogenous to team strength — turns out to be binding. After controlling for strength, only one of the two hypothesised legs survives.

What the probe measures

For all 314 matches in the six StatsBomb open-data tournaments (WC 2018 / 2022, Euro 2020 / 2024, Copa América 2024, AFCON 2023) it pulls every goal with its match-minute and scoring team, reconstructs the running scoreline, and treats the match as an inhomogeneous Poisson process. Between consecutive goals the score state is constant, so each segment contributes duration team-minutes of exposure in a state, and the goal that ends a segment is one event credited to the scoring team's state at that instant. The empirical scoring rate in a state is then goals-in-state divided by team-minutes-in-state — a direct, model-free read of the lead/chase prior. All 314 matches reconstructed to their recorded final score exactly (0 dropped on the integrity check); 789 goals total. Shoot-outs (period 5) are excluded.

Result 1 — the corpus is not the constraint

The design note expected the StatsBomb timing corpus to be the failure mode, the way ~143 xG matches sank xG-as-response. It isn't close:

stateteam-minutes of exposuregoals
leading14,515220
level32,731374
trailing14,515195

314 matches is already ~2.2× the xG-response corpus, and even the off-level states — which only accrue exposure after a goal — carry ~14.5k team-minutes and ~200 goals each. A two-parameter global effect has far more than enough data here. The sparsity precedent does not carry over. (Leading and trailing exposures are identical by construction — when one team leads by d the other trails by d — which is a useful check that the segment accounting is symmetric.)

Result 2 — the marginal signal is real, large, and the wrong sign

Goals per team per 90 minutes, by the team's own game state:

staterate / 90rate ratio vs level95% CI
leading1.3641.33[1.12, 1.57] — excludes 1
level1.028
trailing1.2091.18[0.99, 1.40]

Read naively this says leading teams score 33% more, not less — significant, and the opposite of the lead-protection prior. This is the trap the probe exists to catch. Score state is not assigned at random: the team that is leading is disproportionately the stronger team (it leads because it is better and scores more), and matches that reach a lead are the open, end-to-end ones. A naive global two-parameter fit on raw goal events would estimate γ_lead > 0 — learning team strength, which is already inside λ_base, and double-counting it. On raw rates the model is mis-identified before it starts.

Result 3 — controlling for strength, only the chase leg survives

The fix mirrors the model's own λ(t,d) = λ_base · exp(…) structure: measure each team's deviation from its own baseline scoring level in each state (a team fixed effect). Observed-over-expected, where expected = team baseline rate × exposure-in-state:

stateO/E95% CIreading
leading1.05[0.92, 1.20]not distinguishable from baseline
level0.90[0.82, 1.00]slightly below baseline
trailing1.18[1.02, 1.36]above baseline — significant

Once strength is netted out:

  • Lead protection (the leader suppresses its own scoring) is not supported. O/E 1.05, CI spans 1. The entire marginal "leaders score 33% more" was strength selection; the within-team residual is flat. The γ_lead leg of the proposed model should be expected ≈ 0.
  • The chase effect is supported. Trailing teams score ~18% above their own baseline, and the CI excludes 1. This is a genuine within-match dynamic, not a strength artefact.

One residual confound cuts in our favour: the team fixed effect does not control for opponent strength, and teams tend to lead against weaker defences. That would, if anything, inflate a leader's apparent rate and understate a real chase — so the null leader leg is conservative and the surviving chase leg is robust. (The classic "park the bus" intuition isn't contradicted: lead protection in football shows up mostly as the trailing opponent attacking more, which is exactly the chase leg, not as the leader scoring less.)

The minute-bucketed cut is reported in the JSON but not used for the verdict: the 90'+ bucket shows inflated rates in every state because stoppage-time and extra-time goal density is measured against a thin sliver of exposure — a bucketing artefact, not a time-intensification signal.

Decision

The design note's branch was: proceed to the two-parameter global fit if the effect is present and the corpus beats the xG-response corpus; otherwise record the negative result and stop. The probe lands between those branches and revises the next step rather than taking either:

  1. Do not run the two-parameter global fit on raw goal rates. It is mis-identified — it would learn γ_lead > 0 from strength selection and most likely degrade calibration by double-counting strength already in λ_base.
  2. If a fit is attempted, it must be a strength-residualised, effectively one-parameter (chase-only) model, identifying the multiplier per match against each team's λ_base, with the lead leg pinned near zero. The realistic ceiling is the chase leg alone — roughly half the two-legged story the design note hypothesised.
  3. Expect the base-rate outcome. An ~18% chase multiplier, concentrated in limited late-game exposure and partly absorbed by the team-level fit, is squarely in the territory where state-space (+31bp, no ship) and style-matchup (flat) landed. The honest prior for the gate is no-ship.

The cheap probe did its job: it converted the design note's framing ("corpus is the only risk, proceed if it's big enough") into a sharper and more accurate one — the corpus is fine; identification is the risk; the effect is one-legged and modest. That is a better place to decide from, and it is the kind of result the design note anticipated either way: a documented limit on what in-match dynamics our data can cleanly support, stated as probabilities and rates only — no odds, no market comparison, no betting-action framing (COMPLIANCE.md §3).