Research

Negative results

The model variants and feature additions that were tested, judged against the 8x90-day walk-forward Brier + ECE gate, and did not improve on the shipping ensemble. Published in full because the decision not to ship is the same calibration story as the decision to ship: every entry below records a hypothesis someone could have written, the test that judged it, and the reason the test said no.

11 of 25 notes in the corpus are no-ships. The full notes index, including the variants that did ship, lives at /research/notes/.

Why publish the no-ships

No cherry-picking. If only the variants that improved the gate were published, the shipping ensemble would look more inevitable than it is. The no-ships are evidence of what the corpus and the gate cannot distinguish -- they are the negative space around every shipped model change.
Prevents re-testing by accident. A six-month-old failed ablation is invisible to a new collaborator unless its writeup is discoverable. Keeping negative results on the same surface as positive ones means "did anyone try this?" has an answer that does not require reading the commit log.
Bounds the model's ceiling. A run of failed capacity-heavy variants on the same corpus is itself a measurement: the gate is hard to beat with the data currently available. That signal is more useful to a reader who can see the failures than to one who only sees the wins.

Negative results

Why publish the no-ships

A within-match chase layer "passes" the headline gate — and the placebo proves it shouldn't

Is composite coverage the lever for the player-strength offset? (No)

Does a player-form (momentum) offset improve match forecasts? (No)

Can we fit the player-strength coefficient instead of hand-setting it? (No)

Anytime-scorer `start_prob` v2 — predicted-XI layer (default-off)

Do teams try harder in must-win games? (No, actually)

Letting team ratings drift over time (didn't improve predictions)

Do some playing styles beat others? (Not enough to measure)

Retuning the models for tournament football — what changed

Does extra rest between matches help? (Not measurably)

Can international-tournament StatsBomb signals beat the club-derived baseline?

Why publish the no-ships

A within-match chase layer "passes" the headline gate — and the placebo proves it shouldn't

Is composite *coverage* the lever for the player-strength offset? (No)

Does a player-form (momentum) offset improve match forecasts? (No)

Can we fit the player-strength coefficient instead of hand-setting it? (No)

Anytime-scorer `start_prob` v2 — predicted-XI layer (default-off)

Do teams try harder in must-win games? (No, actually)

Letting team ratings drift over time (didn't improve predictions)

Do some playing styles beat others? (Not enough to measure)

Retuning the models for tournament football — what changed

Does extra rest between matches help? (Not measurably)

Can international-tournament StatsBomb signals beat the club-derived baseline?

Is composite coverage the lever for the player-strength offset? (No)