title: "Post-processing parameter sweep: extremization, GK offset, calibrator" status: shipped date: 2026-06-10 slug: postprocess-tuning-sweep
Walk-forward backtest (8 folds x 90 days, n=1,844 matches) over four post-processing dimensions applied to the uniform Elo/DC/HP ensemble.
Sweep A: Extremization d
Removing extremization (d=1.0) improved ensemble Brier by 26.8bp, the single largest gain.
| d | Brier | ECE |
|---|---|---|
| 1.00 | 0.49866 | 0.0336 |
| 1.05 | 0.49920 | 0.0399 |
| 1.10 | 0.50011 | 0.0356 |
| 1.15 (prod) | 0.50134 | 0.0390 |
| 1.20 | 0.50285 | 0.0438 |
| 1.25 | 0.50461 | 0.0521 |
| 1.30 | 0.50658 | 0.0581 |
Monotonically worse as d increases. Removing extremization (d=1.0) improves ensemble Brier by 26.8bp. The Ranjan-Gneiting (2010) correction assumes the component forecasters are individually well-calibrated, but our Elo uses a fixed 22% draw rate that is poorly calibrated by itself. Extremizing a poorly-calibrated component's contribution amplifies the miscalibration rather than correcting underconfidence.
Decision: ship d=1.0 (disable extremization).
Sweep B: Calibrator on/off
| Config | Brier | ECE |
|---|---|---|
| No calibrator | 0.50134 | 0.0390 |
| With tier_platt | 0.50512 | 0.0462 |
The production calibrator (Platt temperature) hurts Brier by 37.8bp in walk-forward. The calibrator was fitted on a 24-month window that includes some of the walk's training data, so this may partly reflect in-sample overfit. Production default of no-calibrator is correct.
Decision: no change (calibrator already off by default).
Sweep C: GK offset alpha
| alpha | Brier | ECE |
|---|---|---|
| 0.000 | 0.49866 | 0.0336 |
| 0.025 | 0.49870 | 0.0332 |
| 0.050 (prod) | 0.49874 | 0.0336 |
| 0.075 | 0.49879 | 0.0329 |
| 0.100 | 0.49884 | 0.0333 |
The GK offset adds 0.8bp of noise. Effect is too small to be confident it is real signal rather than noise. Removing it simplifies the pipeline.
Decision: ship alpha=0.0 (disable GK offset).
Sweep D: Per-class extremization (d_draw separate)
With d_H/A fixed at 1.15, varying d_draw from 0.90 to 1.20. All values worse than Sweep A's d=1.00 baseline. Moot given Sweep A result.
Decision: no ship (moot).
Elo conversion: ordered logit
Walk-forward evaluation of replacing the fixed 22% draw formula with an ordered logistic regression for Elo-to-P(H/D/A) conversion.
| Scope | Current | Ordered logit | Delta |
|---|---|---|---|
| Elo component only | 0.5332 | 0.5230 | -102bp |
| Full ensemble | 0.49866 | 0.50027 | +16.1bp |
The ordered logit dramatically improves the Elo component (better draw calibration), but when plugged into the ensemble it hurts Brier by 16.1bp while improving ECE by 3.8pp. This is an ensemble diversity effect: the better-calibrated Elo component becomes more correlated with DC/HP, reducing the information gain from averaging. The original poorly-calibrated Elo component, paradoxically, contributes more useful signal to the ensemble because its errors are less correlated with the Poisson models.
Stable parameters across 8 walks: theta_1 ~ -0.685, theta_2 ~ +0.625, beta ~ 0.00513. Generalized variant (5 params) adds only 5bp on the component for 2 extra params.
Decision: no ship (hurts ensemble).
Neutral-site / WC-specific adjustments
Analysis of 9,200+ international matches with Elo residuals at neutral venues:
- Home-continent advantage: +5.0pp win probability (n=124, significant)
- Host nation bonus: +8.6pp (n=36, small sample, confounded)
- Mexico City altitude: +9.2pp (confounded with home advantage)
- Travel proximity: +4.6pp (corroborates continent effect)
CONCACAF teams playing in North American WC venues should see a measurable boost. However, implementing this as a pre-match adjustment requires careful gate testing (the effect is real in aggregate but noisy per-match), so this is parked for a follow-up experiment.
Decision: no ship yet (needs gate test as a feature).
DC half-life sweep
First run was a null result due to a bug: Python binds default parameter
values at function definition time, so fdc.HALF_LIFE_DAYS = hl did not
propagate to fdc.fit() which had half_life_days: float = HALF_LIFE_DAYS
already bound to 1825. Fixed to pass half_life_days=hl explicitly. Re-run
pending.
Combined impact
Disabling extremization (d=1.0) and GK offset (alpha=0.0) together yields ensemble Brier 0.49866, down from 0.50134 (production d=1.15, alpha=0.05). Net improvement: -26.8bp.