A within-match chase layer "passes" the headline gate — and the placebo proves it shouldn't

Status: Not shipped. The naive 3-way gate is fooled; controls show why. Decision gate at the bottom. Backtest date: 2026-06-03 Reproducer: .venv/bin/python scripts/analysis/gamestate_chase_backtest.py --folds 8 --window-years 3 --min-matches 30 --today 2026-05-25 then re-run with --placebo flat and --placebo draw (replays instantly off the cached λ's). Persisted output: data/raw/gamestate_chase_*.json and per-walk λ caches under data/raw/gamestate_chase_lamcache/ (gitignored, regenerable). Prior notes: within-match-gamestate-design.md (proposal), within-match-gamestate-feasibility.md (the probe that motivated a chase-only fit).

The feasibility probe found that, after controlling for team strength, only the chase leg of the game-state hypothesis survives — trailing teams score ~18% above their own baseline. This note builds the cheapest faithful model of that leg and runs it through the repo's walk-forward gate. The headline 3-way Brier says ship. Three placebo controls say don't — the "improvement" is a draw-rate calibration correction wearing a game-state costume, not within-match dynamics. The metric was too blunt to tell the difference; the controls are what catch it.

The model and the test

A single global multiplier scales a trailing team's instantaneous scoring rate, λ(t,d) = λ_base · exp(γ · 1[d<0] · t/90), integrated over the within-match score path by an exact forward DP (validated to reduce to independent Poisson at γ=0). The lead leg is pinned at zero — the feasibility probe showed it isn't there. γ is swept on a grid that includes 0 (no chase) as the baseline, and routed through the same metrics.apply_conjunction_gate every shipped offset uses.

The baseline is deliberately feasibility-grade: a stationary DC fit on a 3-year window, realised goals, no HP, no tau, no isotonic layer. That is fine for this test because the gate only ever compares γ against γ=0 on identical λ's — the absolute fit quality is not what is under examination.

The headline result (and why it looks like a ship)

8-fold walk-forward, median across folds:

γ	multiplier	median Brier	ΔBrier	E[goals]
0.00	1.00	0.56737	—	2.555
0.05	1.05	0.56687	−5.0 bp	2.571
0.10	1.11	0.56642	−9.5 bp	2.587
0.166	1.18	0.56570	−16.7 bp	2.609
0.25	1.28	0.56486	−25.1 bp	2.638
0.35	1.42	0.56419	−31.8 bp	2.674

Monotone improvement, ECE within slack, conjunction gate satisfied: a naive read ships it at γ=0.35. But two things are already wrong. First, the optimum sits at the grid edge and keeps falling — the model wants an even larger chase than 0.35. Second, that γ implies a 1.42× trailing boost, far above the 1.18× the feasibility probe actually measured in the data. A parameter that "wants" a chase effect twice the physical size, running off the edge of its grid, is not estimating the chase effect — it is absorbing something else. (This is the same grid-edge tell that exposed the gk-offset "+44bp" as an extremization artefact.)

The placebos: what it's really doing

Because the expensive per-walk DC λ's are cached, alternative transforms replay on the same fits in milliseconds. Two controls isolate the mechanism.

Control 1 — state-blind goal injection (--placebo flat). Scale both teams' λ's up by the same per-match factor, injecting the same extra goals the chase layer adds but with no dependence on the running score. Result: Brier gets monotonically worse (+2.6 to +20.6 bp). So the gain is not generic goal inflation — the score-state structure matters. Good: the chase layer is doing something real to the shape of the outcome distribution.

Control 2 — pure draw-rate nudge (--placebo draw). Take the stationary probs and move a fraction of each win leg into the draw — the chase layer's entire 3-way footprint (more draws, fewer wins), but as a blunt marginal re-balance with no score-path structure at all:

draw shift	median Brier	ΔBrier
0.00	0.56737	—
0.05	0.56480	−25.7 bp
0.10	0.56721	−1.6 bp
0.166	0.57772	+103 bp

A one-line, structureless "shift ~5% of win probability into draws" recovers −25.7 bp of the chase layer's −31.8 bp — about 80% of the entire headline gain. The chase layer is, on the 3-way metric, an elaborate draw-rate correction. It beats the blunt draw-nudge only because its correction is gentler and self-limiting (the draw-nudge overshoots catastrophically past 5%), which is also why the chase optimum has to run to a γ far above the physical value to deliver the same correction.

The correction is already shipped. The real stationary DC model — the one with the Dixon-Coles tau low-score term that exists precisely to fix draw under-prediction — scores median Brier 0.56585, already better than the chase layer's γ=0 baseline (0.56737) and on par with the best draw-nudge (0.56480). Production then adds an isotonic calibration layer on top. The draw-rate deficit the chase layer "discovers" is one the production stack already closes; on a properly calibrated baseline there is no headroom for it to claim.

Corroboration from the faithful baseline: on 6-year-window fits (a better- calibrated baseline), the per-walk best-γ flips sign across folds — some quarters favour γ=0, others high γ — with sub-3bp swings. A real within-match effect would not depend on the baseline's draw calibration that way; a calibration patch does.

Decision

No ship. The chase layer clears the headline conjunction gate, but the clearance is an artefact: it acts as a draw-rate recalibration of a tau-less, uncalibrated feasibility baseline, ~80% reproducible by a structureless draw nudge, with its optimum stranded at the grid edge at a magnitude well above the physically-measured chase. Against the production baseline — which already corrects draw rate via tau plus isotonic calibration — the incremental skill is gone.

This closes the within-match game-state thread the UCL final opened, and it lands exactly where the feasibility note's base-rate prior said it would ("absorbed by the team-level fit / calibration"), now with a mechanism. It also carries a reusable methodological lesson, the same one the design note flagged: the headline 3-way Brier is too blunt to police a layer whose whole footprint is the draw rate — here it actively rewarded a recalibration masquerading as physics, and only the placebo controls (state-blind goal injection; pure draw nudge) told them apart. Any future "it improves Brier" claim for a draw-touching feature should be made to clear its placebo before its gate.

Research-coded calibration work throughout: probabilities and Brier/ECE only, no odds, no market comparison, no betting-action framing (COMPLIANCE.md §3).

全文 · 免费

The model and the test

The headline result (and why it looks like a ship)

The placebos: what it's really doing

Decision