Research

Notes & ablations

Decision logs from the model build. Each note documents a hypothesis, the backtest setup, the result, and whether the adjustment shipped. Negative results are kept here too. The decision not to ship is as important as the decision to ship, and a public record protects against revisiting the same ablation by accident.

View only the 11 no-ship notes →

Shipped29 June 2026
Neural Poisson: a nonlinear extension of Dixon-Coles
The ensemble's three existing models share a structural constraint:
Not shipped3 June 2026
A within-match chase layer "passes" the headline gate — and the placebo proves it shouldn't
The feasibility probe found that, after controlling for team strength, only
Shipped31 May 2026
Testing our approach on the Champions League final
The `/test/live/<slug>/` route renders the live-tracker pipeline
Not shipped29 May 2026
Is composite *coverage* the lever for the player-strength offset? (No)
player-composite's match coverage — whether honestly (point-in-time WC
Production xG path WIRED (2026-05-29) — `auto-refit.yml` fits DC with `--use-xg` and refits the calibrator on the xG-enabled ensemble; HP excluded (fails the gate's ECE half). The xG-enabled artefacts (`dixon_coles.json` / `ensemble_calibrator.json` / `data.json`) regenerate on the first auto-refit after merge (not hand-committed — see "Production wiring" for why). Single-provider corpus (314 StatsBomb + 28 residual Opta Copa-2021 = 342 rows). Gate clears for DC + Ensemble on both evaluation slices29 May 2026
Back-filling international xG from StatsBomb open data
The model's `--use-xg` path fits `round(xG)` as the per-match Poisson response in
Not shipped29 May 2026
Does a player-form (momentum) offset improve match forecasts? (No)
player-form differential offset `Δ = α·(form_home − form_away)` does
Not shipped29 May 2026
Can we fit the player-strength coefficient instead of hand-setting it? (No)
α = 0.05 offset (Model 16) beats a per-fold fitted α on median Brier.
Not shipped27 May 2026
Anytime-scorer `start_prob` v2 — predicted-XI layer (default-off)
Model 5 (`scripts/build_anytime_scorer.py`) produces `P(player scores ≥ 1 across the WC tournament)`. The headline depends on `E[minutes]`, which is derived from `start_prob` (the per-match starter likelihood). The v1 chain was:
Not shipped27 May 2026
Do teams try harder in must-win games? (No, actually)
Football economics literature (Brams & Ismail 2018; Apesteguia & Palacios-Huerta 2010 on tournament-incentive distortions) reports that match outcomes in the final round of group-stage tournaments deviate from baseline expectations when the
Shipped27 May 2026
Hierarchical Poisson — full PyMC NUTS posterior
* Fit posterior: `scripts/fit_hp_posterior.py`
Not shipped27 May 2026
Letting team ratings drift over time (didn't improve predictions)
Per the design note (variant a, "EMA on (α_t, β_t)"): each team's attack/defence parameters should EVOLVE through time rather than absorb every era's matches into a single stationary compromise. Refit DC at K snapshot timestamps (= the 8 qu
Shipped25 May 2026
Are Premier League players really better? Cross-league strength adjustments
α = 0.05 unchanged.
Shipped25 May 2026
Does your starting goalkeeper change your defence? (Yes)
The starting-keeper rating is informative beyond what the team-level
Shipped24 May 2026
Predicting goalscorers: breaking down shot volume and shot quality
- `scripts/build_ratings_player.py` (now splices `big5_player_shooting.parquet` into `ratings_player.csv`)
Not shipped24 May 2026
Do some playing styles beat others? (Not enough to measure)
- `scripts/build_style_matchup_training.py` (per-match training join)
Not shipped23 May 2026
Retuning the models for tournament football — what changed
PR #310 documented that all four models in the ensemble are ~7% worse on tournament matches than on the all-matches average. The natural follow-up is to refit the predict-time knobs on a tournament-only training slice and serve tournament-v
Shipped22 May 2026
How well do the models predict tournaments specifically?
> **Update (2026-06-01) — superseded by a leakage-free measurement.** The ~0.545 tournament Brier below comes from a single-fold backtest that composes its ensemble from the *current* Elo snapshot rather than each team's rating as it stood
Not shipped21 May 2026
Does extra rest between matches help? (Not measurably)
Sports-science literature reports a measurable effect of recovery time on football performance: better-rested teams score slightly more goals than fatigued ones. The expected magnitude is small but consistent across studies (Mohr et al. 201
Shipped
Calibrating predictions differently for friendlies vs tournaments
- `scripts/fit_ensemble_calibrator.py` (current implementation)
Complete — A1 + B1 run on the 8-walk / 90-day harness (snapshot 2026-05-28). Verdict: **neither ships.** B1 fails its gate outright; A1's gate "passes" but on an extremization artefact, not a keeper-skill signal. See the Results section at the end
Can international-tournament StatsBomb signals beat the club-derived baseline?
PR #525 + PR #532 produced two new per-team signals extracted from StatsBomb open event data across WC 2018/2022, Euro 2020/2024, Copa America 2024, AFCON 2023:
Design only. No code written, no fit run, no decision taken
Can team strength change mid-season? Design for a time-varying model
The shipping Dixon-Coles fit (`scripts/fit_dixon_coles.py`) gives every
Not shipped
Can video highlights produce useful match signal? (Early, promising, not shipped)
The model forecasts match outcomes using pre-match information only: Elo ratings, Dixon-Coles parameters, historical xG, squad composites. Once a tournament starts, the model's view of a team can only update through Elo (which reacts to res
Design only. No code written, no fit run, no decision taken
Can we model the game *script*? Design for a within-match game-state layer
Our forecasting stack — ClubElo / FIFA-Elo + Dixon-Coles + Hierarchical
Post-processing parameter sweep
Walk-forward backtest (8 folds x 90 days, n=1,844 matches) over four
Feasibility probe complete. The design note's stated risk was wrong; a different risk is binding. Qualified result — read the decision gate at the bottom
Within-match game-state: the corpus is ample, the confounding is the problem
The design note proposed a within-match game-state layer — a leading team

Notes & ablations

Neural Poisson: a nonlinear extension of Dixon-Coles

A within-match chase layer "passes" the headline gate — and the placebo proves it shouldn't

Testing our approach on the Champions League final

Is composite coverage the lever for the player-strength offset? (No)

Back-filling international xG from StatsBomb open data

Does a player-form (momentum) offset improve match forecasts? (No)

Can we fit the player-strength coefficient instead of hand-setting it? (No)

Anytime-scorer `start_prob` v2 — predicted-XI layer (default-off)

Do teams try harder in must-win games? (No, actually)

Hierarchical Poisson — full PyMC NUTS posterior

Letting team ratings drift over time (didn't improve predictions)

Are Premier League players really better? Cross-league strength adjustments

Does your starting goalkeeper change your defence? (Yes)

Predicting goalscorers: breaking down shot volume and shot quality

Do some playing styles beat others? (Not enough to measure)

Retuning the models for tournament football — what changed

How well do the models predict tournaments specifically?

Does extra rest between matches help? (Not measurably)

Calibrating predictions differently for friendlies vs tournaments

Can international-tournament StatsBomb signals beat the club-derived baseline?

Can team strength change mid-season? Design for a time-varying model

Can video highlights produce useful match signal? (Early, promising, not shipped)

Can we model the game script? Design for a within-match game-state layer

Post-processing parameter sweep

Within-match game-state: the corpus is ample, the confounding is the problem