Can we fit the player-strength coefficient instead of hand-setting it? (No)

Status: Not shipped. Negative result (2026-05-29). The hand-set α = 0.05 offset (Model 16) beats a per-fold fitted α on median Brier. The shipped offset stays as-is. Backtest date: 2026-05-29 Reproducer: scripts/backtest_composite_fitted_alpha.py --today 2026-05-29 Code: scripts/backtest_composite_fitted_alpha.py (new) Tests: tests/test_composite_fitted_alpha.py Sidecar output: data/wc2026/composite_fitted_alpha_backtest.json (gitignored)

Hypothesis

The player-composite differential is the lineup-aware team-strength signal the improvements roadmap §2.3 flags as the highest-leverage item in the literature. Today it enters the model only as a hand-set multiplicative log-rate offset ("Model 16"): for a fixture, Δ = α · (composite_home − composite_away) is added to home log-λ (and subtracted from away log-λ), with the single coefficient α = 0.05 chosen by the coarse 5-point grid {0, 0.005, 0.01, 0.02, 0.05} in scripts/backtest_composite_offset.py.

Two weaknesses motivated re-fitting α as a first-class feature:

The grid is right-censored. 0.05 is the largest candidate and the grid's Brier improvement is monotone increasing in α — so the true optimum could lie off-grid (≥ 0.05). A continuous fit reveals whether it does.
One global α, mild look-ahead. The grid picks a single α by aggregating across all eight walks, so each walk's chosen α "sees" the others. A strictly in-fold fit (validation slice pre-cutoff only, evaluate on the post-cutoff holdout) is the honest test of whether the composite differential generalises as a fitted feature.

Design — why a single continuous α (lowest capacity)

The prior art on this codebase is unambiguous about capacity at this sample size:

The ~20-covariate gradient-boosting meta-learner (fit_ensemble_meta.py) overfit and lost badly (0.533 vs 0.503).
Even the 3-parameter Bayesian stacker (fit_stacking_weights.py) tied uniform and lost by ~1bp on median Brier.

So this experiment fits the single lowest-capacity parametric form available: one continuous scalar α, refit per fold by 1-D Brier minimisation (scipy.optimize.minimize_scalar, bounded [0, 0.30]). This is strictly lower capacity than the rejected alternatives, and lower than the two other variants considered:

Tier-conditional α (separate coefficient per tournament tier) was rejected up front: the composite offset only fires on ≈229 holdout matches across 8 walks (see "The binding constraint" below), so splitting by tier would leave a handful of matches per tier per fold — guaranteed overfit.
A logistic/ordinal-logit blend term adds a second mixing weight and another calibration stage for no structural gain over the offset that is already wired into predict_match.

Method

Per walk (8 folds × 90-day eval windows, identical to the existing backtest harness):

Refit DC + HP on data strictly before the walk's cutoff (backtest_models.fit_models_pre_cutoff).
Fit α in-fold. Carve the 180-day window immediately before the cutoff as a validation slice — still pre-cutoff, no look-ahead into the eval window. Minimise the DC+HP uniform-mean 3-class Brier on that slice's composite-covered matches over α ∈ [0, 0.30]. If the slice has < 8 composite-covered matches, fall back to α = 0.05 (so a thin fold can't emit a wild α; in this run no fold fell back).
Evaluate three coefficient policies on the post-cutoff holdout, all on the same DC/HP fit and the same eval rows: α = 0 (offset off), α = 0.05 (the shipped incumbent), α = α_fit.

Metrics are on the DC+HP uniform mean (Elo excluded — the offset's natural target, matching backtest_composite_offset.py, so the small offset effect isn't diluted by an unrelated component).

Harness fidelity check. The α = 0 and α = 0.05 baselines computed by this script match backtest_composite_offset.py bit-for-bit (0.514023 and 0.511581 median Brier respectively), confirming no methodology drift — the three-way comparison is apples-to-apples.

The binding constraint: the offset fires on only ~12% of matches

The composite covers only the 48 WC2026 qualifiers. The offset fires only when both teams carry a composite. Across the 8-walk holdout that is 229 of 1,970 matches (11.6%) — and one walk has only 2 affected matches:

Walk eval window	n_holdout	both-WC (offset fires)
2024-06-08 → 2024-09-06	265	44
2024-09-06 → 2024-12-05	485	42
2024-12-05 → 2025-03-05	66	2
2025-03-05 → 2025-06-03	202	16
2025-06-03 → 2025-09-01	214	31
2025-09-01 → 2025-11-30	502	49
2025-11-30 → 2026-02-28	89	14
2026-02-28 → 2026-05-29	147	31

On the other ~88% of matches the offset is a literal no-op, so every α produces identical predictions. The validation-slice that drives the in-fold fit is similarly thin (18–86 covered matches). At that effective n the in-fold Brier surface in α is nearly flat and dominated by noise.

Result — gate FAILED

Three-way comparison, DC+HP uniform mean, median across 8 walks:

Policy	median Brier	mean log-loss	mean ECE
α = 0 (offset off)	0.514023	0.860648	8.24 pp
α = 0.05 (incumbent)	0.511581	0.860347	8.60 pp
α = fitted (per-fold)	0.513901	0.860791	8.11 pp

Gate (fitted vs the INCUMBENT α = 0.05): median Brier 0.513901 vs 0.511581 — the fitted variant is 23.2 bp WORSE, so the strictly-lower-Brier half of the conjunction fails outright. (The fitted variant's ECE is better, 6.93pp vs 7.54pp median, but that's moot once Brier fails.) GATE FAILED.

Per-fold fitted α and the eval-window Brier each policy produced:

Walk	n_val_cov	α_fit	Brier α=0	Brier α=0.05	Brier α_fit
2024-06→09	49	0.0001	0.526980	0.526118	0.526978
2024-09→12	67	0.0270	0.517776	0.517576	0.517537
2024-12→03	86	0.0405	0.566370	0.566305	0.566288
2025-03→06	44	0.0291	0.478713	0.481509	0.480284
2025-06→09	18	0.0001	0.455950	0.457612	0.455951
2025-09→11	47	0.0001	0.468151	0.469494	0.468152
2025-11→02	80	0.0001	0.510271	0.505586	0.510265
2026-02→05	63	0.0028	0.525971	0.524360	0.525849

Median fitted α = 0.0014 (vs shipped 0.05).

Why the hand-set prior beats the honest fit

The fitted α collapses toward zero on 5 of 8 folds (α ≈ 0.0001) and only moves meaningfully on 3 folds (0.027, 0.041, 0.029). This is the classic sparse-signal failure mode:

The validation and eval populations differ. The in-fold optimiser minimises Brier on the recent 180-day slice; the eval window is the next 90 days, a partly different set of teams/fixtures. With only dozens of covered matches on each side, the α that's best on the validation slice is a noisy estimate of the α that's best on the eval window.
The Brier-vs-α surface is nearly flat. Because the offset moves only ~12% of matches by a small log-rate amount, the in-fold objective barely changes with α — so the optimiser frequently parks at the lower bound (α ≈ 0).
The fixed 0.05 is a stronger, pooled prior. The grid that chose 0.05 effectively pooled the signal across all eight walks. That pooling is exactly the regularisation the per-fold fit gives up. At this sample size, the pooled prior generalises better than fitting freely in each fold — even though pooling is the mild look-ahead we set out to remove.

The "right-censored grid" hypothesis is partially confirmed — two folds want α > 0.05 (0.0405, 0.0291) — but the signal is far too noisy fold-to-fold for a continuous per-fold fit to capitalise on that. The direction is real (every non-zero α improves the pooled grid Brier); the per-fold fittability is not there.

Decision

No ship. Keep the hand-set α = 0.05 offset (Model 16). A free per-fold fit underperforms it because the composite signal is too sparse (≈12% of matches, dozens per fold) to fit honestly in-fold; the pooled grid's 0.05 is the better-regularised choice at this n.

The real lever is coverage, not coefficient form: the offset will only ever touch WC-vs-WC fixtures until the player composite is extended beyond the 48 qualifiers. Re-running this experiment is worthwhile only after the composite's match coverage materially increases (e.g. club competitions, or a broader international player pool) — at which point a fitted or even tier-conditional α may become honestly estimable. Until then the grid-tuned scalar is the right tool.

Caveats

Negative result, but a clean one. The harness reproduces the incumbent baselines exactly, the fit is strictly in-fold (no look-ahead), and the gate is the project-standard conjunction routed through metrics.apply_conjunction_gate. The conclusion is not an artefact of the experiment design.
No production artefact was mutated. This experiment only writes a gitignored sidecar; dixon_coles.json / hierarchical_poisson.json retain their shipped composite_alpha = 0.05.
Metrics are on the DC+HP mean, not the full Elo+DC+HP ensemble. Folded into the full ensemble (Elo diluting the offset), the effect is smaller still — consistent with the calibration-plateau diagnosis in ds-models-plan.md.

Vollständige Notiz · kostenlos