Research
Notes & ablations
Decision logs from the model build. Each note documents a hypothesis, the backtest setup, the result, and whether the adjustment shipped. Negative results are kept here too — the decision not to ship is as important as the decision to ship, and a public record protects against revisiting the same ablation by accident.
- Not shipped29 May 2026
Is composite *coverage* the lever for the player-strength offset? (No)
player-composite's match coverage — whether honestly (point-in-time WC
- Production xG path WIRED (2026-05-29) — `auto-refit.yml` fits DC with `--use-xg` and refits the calibrator on the xG-enabled ensemble; HP excluded (fails the gate's ECE half). The xG-enabled artefacts (`dixon_coles.json` / `ensemble_calibrator.json` / `data.json`) regenerate on the first auto-refit after merge (not hand-committed — see "Production wiring" for why). Single-provider corpus (314 StatsBomb + 28 residual Opta Copa-2021 = 342 rows). Gate clears for DC + Ensemble on both evaluation slices29 May 2026
Back-filling international xG from StatsBomb open data
The model's `--use-xg` path fits `round(xG)` as the per-match Poisson response in
- Not shipped29 May 2026
Does a player-form (momentum) offset improve match forecasts? (No)
player-form differential offset `Δ = α·(form_home − form_away)` does
- Not shipped29 May 2026
Can we fit the player-strength coefficient instead of hand-setting it? (No)
α = 0.05 offset (Model 16) beats a per-fold fitted α on median Brier.
- Not shipped27 May 2026
Anytime-scorer `start_prob` v2 — predicted-XI layer (default-off)
Model 5 (`scripts/build_anytime_scorer.py`) produces `P(player scores ≥ 1 across the WC tournament)`. The headline depends on `E[minutes]`, which is derived from `start_prob` (the per-match starter likelihood). The v1 chain was:
- Not shipped27 May 2026
Do teams try harder in must-win games? (No, actually)
Football economics literature (Brams & Ismail 2018; Apesteguia & Palacios-Huerta 2010 on tournament-incentive distortions) reports that match outcomes in the final round of group-stage tournaments deviate from baseline expectations when the
- Shipped27 May 2026
Hierarchical Poisson — full PyMC NUTS posterior
* Fit posterior: `scripts/fit_hp_posterior.py`
- Not shipped27 May 2026
Letting team ratings drift over time (didn't improve predictions)
Per the design note (variant a, "EMA on (α_t, β_t)"): each team's attack/defence parameters should EVOLVE through time rather than absorb every era's matches into a single stationary compromise. Refit DC at K snapshot timestamps (= the 8 qu
- Shipped26 May 2026
Testing our approach on the Champions League final
The `/test/live/<slug>/` route renders the live-tracker pipeline
- Shipped25 May 2026
Are Premier League players really better? Cross-league strength adjustments
α = 0.05 unchanged.
- Shipped25 May 2026
Does your starting goalkeeper change your defence? (Yes)
The starting-keeper rating is informative beyond what the team-level
- Shipped24 May 2026
Predicting goalscorers: breaking down shot volume and shot quality
- `scripts/build_ratings_player.py` (now splices `big5_player_shooting.parquet` into `ratings_player.csv`)
- Not shipped24 May 2026
Do some playing styles beat others? (Not enough to measure)
- `scripts/build_style_matchup_training.py` (per-match training join)
- Not shipped23 May 2026
Retuning the models for tournament football — what changed
PR #310 documented that all four models in the ensemble are ~7% worse on tournament matches than on the all-matches average. The natural follow-up is to refit the predict-time knobs on a tournament-only training slice and serve tournament-v
- Shipped22 May 2026
How well do the models predict tournaments specifically?
The shipping backtest in `documentation/methodology.md` reports Brier / log-loss / ECE on a single-fold 5-year holdout (`2021-05-23 → 2026-05-22`). That holdout is dominated by friendlies — only ~15% of the matches are major-tournament game
- Not shipped21 May 2026
Does extra rest between matches help? (Not measurably)
Sports-science literature reports a measurable effect of recovery time on football performance: better-rested teams score slightly more goals than fatigued ones. The expected magnitude is small but consistent across studies (Mohr et al. 201
- Shipped
Calibrating predictions differently for friendlies vs tournaments
- `scripts/fit_ensemble_calibrator.py` (current implementation)
- Complete — A1 + B1 run on the 8-walk / 90-day harness (snapshot 2026-05-28). Verdict: **neither ships.** B1 fails its gate outright; A1's gate "passes" but on an extremization artefact, not a keeper-skill signal. See the Results section at the end
Can international-tournament StatsBomb signals beat the club-derived baseline?
PR #525 + PR #532 produced two new per-team signals extracted from StatsBomb open event data across WC 2018/2022, Euro 2020/2024, Copa America 2024, AFCON 2023:
- Design only. No code written, no fit run, no decision taken
Can team strength change mid-season? Design for a time-varying model
The shipping Dixon-Coles fit (`scripts/fit_dixon_coles.py`) gives every