Status: Not shipped. The naive 3-way gate is fooled; controls show why. Decision gate at the bottom.
Backtest date: 2026-06-03
Reproducer: .venv/bin/python scripts/analysis/gamestate_chase_backtest.py --folds 8 --window-years 3 --min-matches 30 --today 2026-05-25 then re-run with --placebo flat and --placebo draw (replays instantly off the cached λ's).
Persisted output: data/raw/gamestate_chase_*.json and per-walk λ caches under data/raw/gamestate_chase_lamcache/ (gitignored, regenerable).
Prior notes: within-match-gamestate-design.md (proposal), within-match-gamestate-feasibility.md (the probe that motivated a chase-only fit).
The feasibility probe found that, after controlling for team strength, only the chase leg of the game-state hypothesis survives — trailing teams score ~18% above their own baseline. This note builds the cheapest faithful model of that leg and runs it through the repo's walk-forward gate. The headline 3-way Brier says ship. Three placebo controls say don't — the "improvement" is a draw-rate calibration correction wearing a game-state costume, not within-match dynamics. The metric was too blunt to tell the difference; the controls are what catch it.
The model and the test
A single global multiplier scales a trailing team's instantaneous
scoring rate, λ(t,d) = λ_base · exp(γ · 1[d<0] · t/90), integrated over
the within-match score path by an exact forward DP (validated to reduce to
independent Poisson at γ=0). The lead leg is pinned at zero — the
feasibility probe showed it isn't there. γ is swept on a grid that includes
0 (no chase) as the baseline, and routed through the same
metrics.apply_conjunction_gate every shipped offset uses.
The baseline is deliberately feasibility-grade: a stationary DC fit on a 3-year window, realised goals, no HP, no tau, no isotonic layer. That is fine for this test because the gate only ever compares γ against γ=0 on identical λ's — the absolute fit quality is not what is under examination.
The headline result (and why it looks like a ship)
8-fold walk-forward, median across folds:
| γ | multiplier | median Brier | ΔBrier | E[goals] |
|---|---|---|---|---|
| 0.00 | 1.00 | 0.56737 | — | 2.555 |
| 0.05 | 1.05 | 0.56687 | −5.0 bp | 2.571 |
| 0.10 | 1.11 | 0.56642 | −9.5 bp | 2.587 |
| 0.166 | 1.18 | 0.56570 | −16.7 bp | 2.609 |
| 0.25 | 1.28 | 0.56486 | −25.1 bp | 2.638 |
| 0.35 | 1.42 | 0.56419 | −31.8 bp | 2.674 |
Monotone improvement, ECE within slack, conjunction gate satisfied: a naive read ships it at γ=0.35. But two things are already wrong. First, the optimum sits at the grid edge and keeps falling — the model wants an even larger chase than 0.35. Second, that γ implies a 1.42× trailing boost, far above the 1.18× the feasibility probe actually measured in the data. A parameter that "wants" a chase effect twice the physical size, running off the edge of its grid, is not estimating the chase effect — it is absorbing something else. (This is the same grid-edge tell that exposed the gk-offset "+44bp" as an extremization artefact.)
The placebos: what it's really doing
Because the expensive per-walk DC λ's are cached, alternative transforms replay on the same fits in milliseconds. Two controls isolate the mechanism.
Control 1 — state-blind goal injection (--placebo flat). Scale both
teams' λ's up by the same per-match factor, injecting the same extra goals
the chase layer adds but with no dependence on the running score. Result:
Brier gets monotonically worse (+2.6 to +20.6 bp). So the gain is not
generic goal inflation — the score-state structure matters. Good: the chase
layer is doing something real to the shape of the outcome distribution.
Control 2 — pure draw-rate nudge (--placebo draw). Take the stationary
probs and move a fraction of each win leg into the draw — the chase layer's
entire 3-way footprint (more draws, fewer wins), but as a blunt marginal
re-balance with no score-path structure at all:
| draw shift | median Brier | ΔBrier |
|---|---|---|
| 0.00 | 0.56737 | — |
| 0.05 | 0.56480 | −25.7 bp |
| 0.10 | 0.56721 | −1.6 bp |
| 0.166 | 0.57772 | +103 bp |
A one-line, structureless "shift ~5% of win probability into draws" recovers −25.7 bp of the chase layer's −31.8 bp — about 80% of the entire headline gain. The chase layer is, on the 3-way metric, an elaborate draw-rate correction. It beats the blunt draw-nudge only because its correction is gentler and self-limiting (the draw-nudge overshoots catastrophically past 5%), which is also why the chase optimum has to run to a γ far above the physical value to deliver the same correction.
The correction is already shipped. The real stationary DC model — the one with the Dixon-Coles tau low-score term that exists precisely to fix draw under-prediction — scores median Brier 0.56585, already better than the chase layer's γ=0 baseline (0.56737) and on par with the best draw-nudge (0.56480). Production then adds an isotonic calibration layer on top. The draw-rate deficit the chase layer "discovers" is one the production stack already closes; on a properly calibrated baseline there is no headroom for it to claim.
Corroboration from the faithful baseline: on 6-year-window fits (a better- calibrated baseline), the per-walk best-γ flips sign across folds — some quarters favour γ=0, others high γ — with sub-3bp swings. A real within-match effect would not depend on the baseline's draw calibration that way; a calibration patch does.
Decision
No ship. The chase layer clears the headline conjunction gate, but the clearance is an artefact: it acts as a draw-rate recalibration of a tau-less, uncalibrated feasibility baseline, ~80% reproducible by a structureless draw nudge, with its optimum stranded at the grid edge at a magnitude well above the physically-measured chase. Against the production baseline — which already corrects draw rate via tau plus isotonic calibration — the incremental skill is gone.
This closes the within-match game-state thread the UCL final opened, and it lands exactly where the feasibility note's base-rate prior said it would ("absorbed by the team-level fit / calibration"), now with a mechanism. It also carries a reusable methodological lesson, the same one the design note flagged: the headline 3-way Brier is too blunt to police a layer whose whole footprint is the draw rate — here it actively rewarded a recalibration masquerading as physics, and only the placebo controls (state-blind goal injection; pure draw nudge) told them apart. Any future "it improves Brier" claim for a draw-touching feature should be made to clear its placebo before its gate.
Research-coded calibration work throughout: probabilities and Brier/ECE only, no odds, no market comparison, no betting-action framing (COMPLIANCE.md §3).