Nota de pesquisa

Can we fit the player-strength coefficient instead of hand-setting it? (No)

Status: Not shipped. Negative result (2026-05-29). The hand-setData do backtest: 29 May 2026Leitura completa e gratuita · 1,501 palavras

Nota completa · gratuita

Status: Not shipped. Negative result (2026-05-29). The hand-set α = 0.05 offset (Model 16) beats a per-fold fitted α on median Brier. The shipped offset stays as-is. Backtest date: 2026-05-29 Reproducer: scripts/backtest_composite_fitted_alpha.py --today 2026-05-29 Code: scripts/backtest_composite_fitted_alpha.py (new) Tests: tests/test_composite_fitted_alpha.py Sidecar output: data/wc2026/composite_fitted_alpha_backtest.json (gitignored)

Hypothesis

The player-composite differential is the lineup-aware team-strength signal the improvements roadmap §2.3 flags as the highest-leverage item in the literature. Today it enters the model only as a hand-set multiplicative log-rate offset ("Model 16"): for a fixture, Δ = α · (composite_home − composite_away) is added to home log-λ (and subtracted from away log-λ), with the single coefficient α = 0.05 chosen by the coarse 5-point grid {0, 0.005, 0.01, 0.02, 0.05} in scripts/backtest_composite_offset.py.

Two weaknesses motivated re-fitting α as a first-class feature:

  1. The grid is right-censored. 0.05 is the largest candidate and the grid's Brier improvement is monotone increasing in α — so the true optimum could lie off-grid (≥ 0.05). A continuous fit reveals whether it does.
  2. One global α, mild look-ahead. The grid picks a single α by aggregating across all eight walks, so each walk's chosen α "sees" the others. A strictly in-fold fit (validation slice pre-cutoff only, evaluate on the post-cutoff holdout) is the honest test of whether the composite differential generalises as a fitted feature.

Design — why a single continuous α (lowest capacity)

The prior art on this codebase is unambiguous about capacity at this sample size:

  • The ~20-covariate gradient-boosting meta-learner (fit_ensemble_meta.py) overfit and lost badly (0.533 vs 0.503).
  • Even the 3-parameter Bayesian stacker (fit_stacking_weights.py) tied uniform and lost by ~1bp on median Brier.

So this experiment fits the single lowest-capacity parametric form available: one continuous scalar α, refit per fold by 1-D Brier minimisation (scipy.optimize.minimize_scalar, bounded [0, 0.30]). This is strictly lower capacity than the rejected alternatives, and lower than the two other variants considered:

  • Tier-conditional α (separate coefficient per tournament tier) was rejected up front: the composite offset only fires on ≈229 holdout matches across 8 walks (see "The binding constraint" below), so splitting by tier would leave a handful of matches per tier per fold — guaranteed overfit.
  • A logistic/ordinal-logit blend term adds a second mixing weight and another calibration stage for no structural gain over the offset that is already wired into predict_match.

Method

Per walk (8 folds × 90-day eval windows, identical to the existing backtest harness):

  1. Refit DC + HP on data strictly before the walk's cutoff (backtest_models.fit_models_pre_cutoff).
  2. Fit α in-fold. Carve the 180-day window immediately before the cutoff as a validation slice — still pre-cutoff, no look-ahead into the eval window. Minimise the DC+HP uniform-mean 3-class Brier on that slice's composite-covered matches over α ∈ [0, 0.30]. If the slice has < 8 composite-covered matches, fall back to α = 0.05 (so a thin fold can't emit a wild α; in this run no fold fell back).
  3. Evaluate three coefficient policies on the post-cutoff holdout, all on the same DC/HP fit and the same eval rows: α = 0 (offset off), α = 0.05 (the shipped incumbent), α = α_fit.

Metrics are on the DC+HP uniform mean (Elo excluded — the offset's natural target, matching backtest_composite_offset.py, so the small offset effect isn't diluted by an unrelated component).

Harness fidelity check. The α = 0 and α = 0.05 baselines computed by this script match backtest_composite_offset.py bit-for-bit (0.514023 and 0.511581 median Brier respectively), confirming no methodology drift — the three-way comparison is apples-to-apples.

The binding constraint: the offset fires on only ~12% of matches

The composite covers only the 48 WC2026 qualifiers. The offset fires only when both teams carry a composite. Across the 8-walk holdout that is 229 of 1,970 matches (11.6%) — and one walk has only 2 affected matches:

Walk eval windown_holdoutboth-WC (offset fires)
2024-06-08 → 2024-09-0626544
2024-09-06 → 2024-12-0548542
2024-12-05 → 2025-03-05662
2025-03-05 → 2025-06-0320216
2025-06-03 → 2025-09-0121431
2025-09-01 → 2025-11-3050249
2025-11-30 → 2026-02-288914
2026-02-28 → 2026-05-2914731

On the other ~88% of matches the offset is a literal no-op, so every α produces identical predictions. The validation-slice that drives the in-fold fit is similarly thin (18–86 covered matches). At that effective n the in-fold Brier surface in α is nearly flat and dominated by noise.

Result — gate FAILED

Three-way comparison, DC+HP uniform mean, median across 8 walks:

Policymedian Briermean log-lossmean ECE
α = 0 (offset off)0.5140230.8606488.24 pp
α = 0.05 (incumbent)0.5115810.8603478.60 pp
α = fitted (per-fold)0.5139010.8607918.11 pp

Gate (fitted vs the INCUMBENT α = 0.05): median Brier 0.513901 vs 0.511581 — the fitted variant is 23.2 bp WORSE, so the strictly-lower-Brier half of the conjunction fails outright. (The fitted variant's ECE is better, 6.93pp vs 7.54pp median, but that's moot once Brier fails.) GATE FAILED.

Per-fold fitted α and the eval-window Brier each policy produced:

Walkn_val_covα_fitBrier α=0Brier α=0.05Brier α_fit
2024-06→09490.00010.5269800.5261180.526978
2024-09→12670.02700.5177760.5175760.517537
2024-12→03860.04050.5663700.5663050.566288
2025-03→06440.02910.4787130.4815090.480284
2025-06→09180.00010.4559500.4576120.455951
2025-09→11470.00010.4681510.4694940.468152
2025-11→02800.00010.5102710.5055860.510265
2026-02→05630.00280.5259710.5243600.525849

Median fitted α = 0.0014 (vs shipped 0.05).

Why the hand-set prior beats the honest fit

The fitted α collapses toward zero on 5 of 8 folds (α ≈ 0.0001) and only moves meaningfully on 3 folds (0.027, 0.041, 0.029). This is the classic sparse-signal failure mode:

  • The validation and eval populations differ. The in-fold optimiser minimises Brier on the recent 180-day slice; the eval window is the next 90 days, a partly different set of teams/fixtures. With only dozens of covered matches on each side, the α that's best on the validation slice is a noisy estimate of the α that's best on the eval window.
  • The Brier-vs-α surface is nearly flat. Because the offset moves only ~12% of matches by a small log-rate amount, the in-fold objective barely changes with α — so the optimiser frequently parks at the lower bound (α ≈ 0).
  • The fixed 0.05 is a stronger, pooled prior. The grid that chose 0.05 effectively pooled the signal across all eight walks. That pooling is exactly the regularisation the per-fold fit gives up. At this sample size, the pooled prior generalises better than fitting freely in each fold — even though pooling is the mild look-ahead we set out to remove.

The "right-censored grid" hypothesis is partially confirmed — two folds want α > 0.05 (0.0405, 0.0291) — but the signal is far too noisy fold-to-fold for a continuous per-fold fit to capitalise on that. The direction is real (every non-zero α improves the pooled grid Brier); the per-fold fittability is not there.

Decision

No ship. Keep the hand-set α = 0.05 offset (Model 16). A free per-fold fit underperforms it because the composite signal is too sparse (≈12% of matches, dozens per fold) to fit honestly in-fold; the pooled grid's 0.05 is the better-regularised choice at this n.

The real lever is coverage, not coefficient form: the offset will only ever touch WC-vs-WC fixtures until the player composite is extended beyond the 48 qualifiers. Re-running this experiment is worthwhile only after the composite's match coverage materially increases (e.g. club competitions, or a broader international player pool) — at which point a fitted or even tier-conditional α may become honestly estimable. Until then the grid-tuned scalar is the right tool.

Caveats

  • Negative result, but a clean one. The harness reproduces the incumbent baselines exactly, the fit is strictly in-fold (no look-ahead), and the gate is the project-standard conjunction routed through metrics.apply_conjunction_gate. The conclusion is not an artefact of the experiment design.
  • No production artefact was mutated. This experiment only writes a gitignored sidecar; dixon_coles.json / hierarchical_poisson.json retain their shipped composite_alpha = 0.05.
  • Metrics are on the DC+HP mean, not the full Elo+DC+HP ensemble. Folded into the full ensemble (Elo diluting the offset), the effect is smaller still — consistent with the calibration-plateau diagnosis in ds-models-plan.md.