Research

Negative results

The model variants and feature additions that were tested, judged against the 8×90-day walk-forward Brier + ECE gate, and did not improve on the shipping ensemble. Published in full because the decision not to ship is the same calibration story as the decision to ship: every entry below records a hypothesis someone could have written, the test that judged it, and the reason the test said no.

9 of 19 notes in the corpus are no-ships. The full notes index, including the variants that did ship, lives at /research/notes/.

Why publish the no-ships

  • No cherry-picking. If only the variants that improved the gate were published, the shipping ensemble would look more inevitable than it is. The no-ships are evidence of what the corpus and the gate cannot distinguish — they are the negative space around every shipped model change.
  • Prevents re-testing by accident. A six-month-old failed ablation is invisible to a new collaborator unless its writeup is discoverable. Keeping negative results on the same surface as positive ones means "did anyone try this?" has an answer that does not require reading the commit log.
  • Bounds the model's ceiling. A run of failed capacity-heavy variants on the same corpus is itself a measurement: the gate is hard to beat with the data currently available. That signal is more useful to a reader who can see the failures than to one who only sees the wins.