ملاحظة بحثية

Post-processing parameter sweep

مجانية للقراءة بالكامل · 695 كلمة

الملاحظة الكاملة · مجانية


title: "Post-processing parameter sweep: extremization, GK offset, calibrator" status: shipped date: 2026-06-10 slug: postprocess-tuning-sweep

Walk-forward backtest (8 folds x 90 days, n=1,844 matches) over four post-processing dimensions applied to the uniform Elo/DC/HP ensemble.

Sweep A: Extremization d

Removing extremization (d=1.0) improved ensemble Brier by 26.8bp, the single largest gain.

dBrierECE
1.000.498660.0336
1.050.499200.0399
1.100.500110.0356
1.15 (prod)0.501340.0390
1.200.502850.0438
1.250.504610.0521
1.300.506580.0581

Monotonically worse as d increases. Removing extremization (d=1.0) improves ensemble Brier by 26.8bp. The Ranjan-Gneiting (2010) correction assumes the component forecasters are individually well-calibrated, but our Elo uses a fixed 22% draw rate that is poorly calibrated by itself. Extremizing a poorly-calibrated component's contribution amplifies the miscalibration rather than correcting underconfidence.

Decision: ship d=1.0 (disable extremization).

Sweep B: Calibrator on/off

ConfigBrierECE
No calibrator0.501340.0390
With tier_platt0.505120.0462

The production calibrator (Platt temperature) hurts Brier by 37.8bp in walk-forward. The calibrator was fitted on a 24-month window that includes some of the walk's training data, so this may partly reflect in-sample overfit. Production default of no-calibrator is correct.

Decision: no change (calibrator already off by default).

Sweep C: GK offset alpha

alphaBrierECE
0.0000.498660.0336
0.0250.498700.0332
0.050 (prod)0.498740.0336
0.0750.498790.0329
0.1000.498840.0333

The GK offset adds 0.8bp of noise. Effect is too small to be confident it is real signal rather than noise. Removing it simplifies the pipeline.

Decision: ship alpha=0.0 (disable GK offset).

Sweep D: Per-class extremization (d_draw separate)

With d_H/A fixed at 1.15, varying d_draw from 0.90 to 1.20. All values worse than Sweep A's d=1.00 baseline. Moot given Sweep A result.

Decision: no ship (moot).

Elo conversion: ordered logit

Walk-forward evaluation of replacing the fixed 22% draw formula with an ordered logistic regression for Elo-to-P(H/D/A) conversion.

ScopeCurrentOrdered logitDelta
Elo component only0.53320.5230-102bp
Full ensemble0.498660.50027+16.1bp

The ordered logit dramatically improves the Elo component (better draw calibration), but when plugged into the ensemble it hurts Brier by 16.1bp while improving ECE by 3.8pp. This is an ensemble diversity effect: the better-calibrated Elo component becomes more correlated with DC/HP, reducing the information gain from averaging. The original poorly-calibrated Elo component, paradoxically, contributes more useful signal to the ensemble because its errors are less correlated with the Poisson models.

Stable parameters across 8 walks: theta_1 ~ -0.685, theta_2 ~ +0.625, beta ~ 0.00513. Generalized variant (5 params) adds only 5bp on the component for 2 extra params.

Decision: no ship (hurts ensemble).

Neutral-site / WC-specific adjustments

Analysis of 9,200+ international matches with Elo residuals at neutral venues:

  • Home-continent advantage: +5.0pp win probability (n=124, significant)
  • Host nation bonus: +8.6pp (n=36, small sample, confounded)
  • Mexico City altitude: +9.2pp (confounded with home advantage)
  • Travel proximity: +4.6pp (corroborates continent effect)

CONCACAF teams playing in North American WC venues should see a measurable boost. However, implementing this as a pre-match adjustment requires careful gate testing (the effect is real in aggregate but noisy per-match), so this is parked for a follow-up experiment.

Decision: no ship yet (needs gate test as a feature).

DC half-life sweep

First run was a null result due to a bug: Python binds default parameter values at function definition time, so fdc.HALF_LIFE_DAYS = hl did not propagate to fdc.fit() which had half_life_days: float = HALF_LIFE_DAYS already bound to 1825. Fixed to pass half_life_days=hl explicitly. Re-run pending.

Combined impact

Disabling extremization (d=1.0) and GK offset (alpha=0.0) together yields ensemble Brier 0.49866, down from 0.50134 (production d=1.15, alpha=0.05). Net improvement: -26.8bp.