Status: Feasibility probe complete. The design note's stated risk was wrong; a different risk is binding. Qualified result — read the decision gate at the bottom.
Probe date: 2026-06-03
Reproducer: .venv/bin/python scripts/analysis/gamestate_feasibility.py (network, first run caches ~314 matches), then --no-network to re-aggregate offline.
Persisted output: per-match goal timelines under data/raw/statsbomb_gamestate/ and the result JSON via --json-out (both gitignored, regenerable per host).
Design note: documentation/research-notes/within-match-gamestate-design.md
The design note proposed a within-match game-state layer — a leading team suppressing its own scoring (lead protection) and a trailing team raising it (the chase) — and named corpus sparsity as the binding risk, by analogy to the xG-as-response gate that failed on ~143 matches. This probe ran the cheap feasibility check the note gated on, before any model is built. It overturns the note's risk assessment in both directions: the corpus is 2× larger than feared and not the constraint, while a risk the note did not flag — game-state is endogenous to team strength — turns out to be binding. After controlling for strength, only one of the two hypothesised legs survives.
What the probe measures
For all 314 matches in the six StatsBomb open-data tournaments (WC 2018 /
2022, Euro 2020 / 2024, Copa América 2024, AFCON 2023) it pulls every goal
with its match-minute and scoring team, reconstructs the running scoreline,
and treats the match as an inhomogeneous Poisson process. Between
consecutive goals the score state is constant, so each segment contributes
duration team-minutes of exposure in a state, and the goal that ends a
segment is one event credited to the scoring team's state at that instant.
The empirical scoring rate in a state is then goals-in-state divided by
team-minutes-in-state — a direct, model-free read of the lead/chase prior.
All 314 matches reconstructed to their recorded final score exactly (0
dropped on the integrity check); 789 goals total. Shoot-outs (period 5) are
excluded.
Result 1 — the corpus is not the constraint
The design note expected the StatsBomb timing corpus to be the failure mode, the way ~143 xG matches sank xG-as-response. It isn't close:
| state | team-minutes of exposure | goals |
|---|---|---|
| leading | 14,515 | 220 |
| level | 32,731 | 374 |
| trailing | 14,515 | 195 |
314 matches is already ~2.2× the xG-response corpus, and even the off-level states — which only accrue exposure after a goal — carry ~14.5k team-minutes and ~200 goals each. A two-parameter global effect has far more than enough data here. The sparsity precedent does not carry over. (Leading and trailing exposures are identical by construction — when one team leads by d the other trails by d — which is a useful check that the segment accounting is symmetric.)
Result 2 — the marginal signal is real, large, and the wrong sign
Goals per team per 90 minutes, by the team's own game state:
| state | rate / 90 | rate ratio vs level | 95% CI |
|---|---|---|---|
| leading | 1.364 | 1.33 | [1.12, 1.57] — excludes 1 |
| level | 1.028 | — | — |
| trailing | 1.209 | 1.18 | [0.99, 1.40] |
Read naively this says leading teams score 33% more, not less —
significant, and the opposite of the lead-protection prior. This is the
trap the probe exists to catch. Score state is not assigned at random: the
team that is leading is disproportionately the stronger team (it leads
because it is better and scores more), and matches that reach a lead are
the open, end-to-end ones. A naive global two-parameter fit on raw goal
events would estimate γ_lead > 0 — learning team strength, which is
already inside λ_base, and double-counting it. On raw rates the model is
mis-identified before it starts.
Result 3 — controlling for strength, only the chase leg survives
The fix mirrors the model's own λ(t,d) = λ_base · exp(…) structure:
measure each team's deviation from its own baseline scoring level in each
state (a team fixed effect). Observed-over-expected, where expected =
team baseline rate × exposure-in-state:
| state | O/E | 95% CI | reading |
|---|---|---|---|
| leading | 1.05 | [0.92, 1.20] | not distinguishable from baseline |
| level | 0.90 | [0.82, 1.00] | slightly below baseline |
| trailing | 1.18 | [1.02, 1.36] | above baseline — significant |
Once strength is netted out:
- Lead protection (the leader suppresses its own scoring) is not
supported. O/E 1.05, CI spans 1. The entire marginal "leaders score
33% more" was strength selection; the within-team residual is flat. The
γ_leadleg of the proposed model should be expected ≈ 0. - The chase effect is supported. Trailing teams score ~18% above their own baseline, and the CI excludes 1. This is a genuine within-match dynamic, not a strength artefact.
One residual confound cuts in our favour: the team fixed effect does not control for opponent strength, and teams tend to lead against weaker defences. That would, if anything, inflate a leader's apparent rate and understate a real chase — so the null leader leg is conservative and the surviving chase leg is robust. (The classic "park the bus" intuition isn't contradicted: lead protection in football shows up mostly as the trailing opponent attacking more, which is exactly the chase leg, not as the leader scoring less.)
The minute-bucketed cut is reported in the JSON but not used for the verdict: the 90'+ bucket shows inflated rates in every state because stoppage-time and extra-time goal density is measured against a thin sliver of exposure — a bucketing artefact, not a time-intensification signal.
Decision
The design note's branch was: proceed to the two-parameter global fit if the effect is present and the corpus beats the xG-response corpus; otherwise record the negative result and stop. The probe lands between those branches and revises the next step rather than taking either:
- Do not run the two-parameter global fit on raw goal rates. It is
mis-identified — it would learn
γ_lead > 0from strength selection and most likely degrade calibration by double-counting strength already inλ_base. - If a fit is attempted, it must be a strength-residualised, effectively
one-parameter (chase-only) model, identifying the multiplier per match
against each team's
λ_base, with the lead leg pinned near zero. The realistic ceiling is the chase leg alone — roughly half the two-legged story the design note hypothesised. - Expect the base-rate outcome. An ~18% chase multiplier, concentrated in limited late-game exposure and partly absorbed by the team-level fit, is squarely in the territory where state-space (+31bp, no ship) and style-matchup (flat) landed. The honest prior for the gate is no-ship.
The cheap probe did its job: it converted the design note's framing ("corpus is the only risk, proceed if it's big enough") into a sharper and more accurate one — the corpus is fine; identification is the risk; the effect is one-legged and modest. That is a better place to decide from, and it is the kind of result the design note anticipated either way: a documented limit on what in-match dynamics our data can cleanly support, stated as probabilities and rates only — no odds, no market comparison, no betting-action framing (COMPLIANCE.md §3).