Research
Negative results
The model variants and feature additions that were tested, judged against the 8×90-day walk-forward Brier + ECE gate, and did not improve on the shipping ensemble. Published in full because the decision not to ship is the same calibration story as the decision to ship: every entry below records a hypothesis someone could have written, the test that judged it, and the reason the test said no.
9 of 19 notes in the corpus are no-ships. The full notes index, including the variants that did ship, lives at /research/notes/.
Why publish the no-ships
- No cherry-picking. If only the variants that improved the gate were published, the shipping ensemble would look more inevitable than it is. The no-ships are evidence of what the corpus and the gate cannot distinguish — they are the negative space around every shipped model change.
- Prevents re-testing by accident. A six-month-old failed ablation is invisible to a new collaborator unless its writeup is discoverable. Keeping negative results on the same surface as positive ones means "did anyone try this?" has an answer that does not require reading the commit log.
- Bounds the model's ceiling. A run of failed capacity-heavy variants on the same corpus is itself a measurement: the gate is hard to beat with the data currently available. That signal is more useful to a reader who can see the failures than to one who only sees the wins.
- Not shipped29 May 2026
Is composite *coverage* the lever for the player-strength offset? (No)
player-composite's match coverage — whether honestly (point-in-time WC
Read note →
- Not shipped29 May 2026
Does a player-form (momentum) offset improve match forecasts? (No)
player-form differential offset `Δ = α·(form_home − form_away)` does
Read note →
- Not shipped29 May 2026
Can we fit the player-strength coefficient instead of hand-setting it? (No)
α = 0.05 offset (Model 16) beats a per-fold fitted α on median Brier.
Read note →
- Not shipped27 May 2026
Anytime-scorer `start_prob` v2 — predicted-XI layer (default-off)
Model 5 (`scripts/build_anytime_scorer.py`) produces `P(player scores ≥ 1 across the WC tournament)`. The headline depends on `E[minutes]`, which is derived from `start_prob` (the per-match starter likelihood). The v1 chain was:
Read note →
- Not shipped27 May 2026
Do teams try harder in must-win games? (No, actually)
Football economics literature (Brams & Ismail 2018; Apesteguia & Palacios-Huerta 2010 on tournament-incentive distortions) reports that match outcomes in the final round of group-stage tournaments deviate from baseline expectations when the
Read note →
- Not shipped27 May 2026
Letting team ratings drift over time (didn't improve predictions)
Per the design note (variant a, "EMA on (α_t, β_t)"): each team's attack/defence parameters should EVOLVE through time rather than absorb every era's matches into a single stationary compromise. Refit DC at K snapshot timestamps (= the 8 qu
Read note →
- Not shipped24 May 2026
Do some playing styles beat others? (Not enough to measure)
- `scripts/build_style_matchup_training.py` (per-match training join)
Read note →
- Not shipped23 May 2026
Retuning the models for tournament football — what changed
PR #310 documented that all four models in the ensemble are ~7% worse on tournament matches than on the all-matches average. The natural follow-up is to refit the predict-time knobs on a tournament-only training slice and serve tournament-v
Read note →
- Not shipped21 May 2026
Does extra rest between matches help? (Not measurably)
Sports-science literature reports a measurable effect of recovery time on football performance: better-rested teams score slightly more goals than fatigued ones. The expected magnitude is small but consistent across studies (Mohr et al. 201
Read note →