Nota de pesquisa

Do some playing styles beat others? (Not enough to measure)

Status: Not shipped. Both acceptance gates failed. See decision gate at the bottomData do backtest: 24 May 2026Leitura completa e gratuita · 1,715 palavras

Nota completa · gratuita

Status: Not shipped. Both acceptance gates failed. See decision gate at the bottom. Backtest date: 2026-05-24 Reproducers:

  • scripts/build_style_matchup_training.py (per-match training join)
  • scripts/backtest_style_matchup.py (walk-forward gate harness) Persisted outputs:
  • data/wc2026/style_matchup_training.csv — per-match training pool (regenerable, gitignored)
  • data/wc2026/style_matchup_backtest.json — per-walk metrics + gate verdicts (regenerable, gitignored) Roadmap entry: documentation/improvements-roadmap.md §2.7 Track B

Hypothesis

The §2.7 Track B model (canonical reference: scripts/fit_style_matchup.py) posits that historical international match results carry a small, per-style-pair residual that DC and HP — which average over style at the team level — do not absorb. Specifically, for each (home_style, away_style) cell in an 8×8 grid of canonical tactical-fingerprint labels, fit a pair of log-rate offsets (δ_home, δ_away) that nudge each side's Poisson rate after the goal-process baseline.

Model:

log λ_h = log λ̃_h + δ_h[s_h, s_a]
log λ_a = log λ̃_a + δ_a[s_h, s_a]

with weakly-informative Gaussian shrinkage prior δ ~ N(0, σ²), σ = 0.05 (the published-research scale for style-based effects in football: ~95% prior mass within ±0.10 of log-rate, ≈ ±10% on the rate).

The mock artefact (data/wc2026/style_matchup.mock.json) carries hand-authored illustrative pair effects on the published-research scale. The question this note answers is: does a real MAP fit on historical results clear the gates needed to replace the mock?

Backtest setup

Training pool join.

  • Source: data/raw/intl/results.csv (martj42 international_results, 49,330 matches).
  • Filter: date ≥ 2015-01-01 AND both teams in tactical_fingerprint.csv with a canonical style label (i.e. not insufficient-data).
  • 8 of 48 WC2026 teams carry the insufficient-data label (Norway, Iraq, NZ, Curaçao, Haiti, Jordan, BiH, Uzbekistan); matches involving them are excluded.
  • Per-match baseline λ computed via fit_dixon_coles.predict_match against the production DC fit, with composite-α and GK-α offsets OFF (so the style residual is identified against a clean goal-process baseline, not against a baseline that already absorbs other team-level features).
  • Output: 1,036 training rows spanning 2015-01-04 → 2026-03-31, all 64 pair cells populated (6 cells sparse, <5 matches; 18 cells dense, ≥20 matches).

Walk-forward harness.

  • 8 folds × 90 days, ending 2026-05-24 — same windows as every other gate (§1.4 walk-forward, §2.3 composite-α, §2.6 GK-α).
  • Per walk: refit DC + HP on pre-cutoff matches (full corpus, ~9k each), then build the style-matchup training pool from pre-cutoff matches dated ≥ 2015-01-01 with both sides fingerprinted, then MAP-fit pair effects on the pool, then evaluate the walk's holdout twice — baseline (no offset) and offset-on — for DC, HP, and the uniform DC+HP mean.
  • The training_since=2015-01-01 floor matches the production CSV's filter; without it, the per-walk fit and the production fit would be two different models (per-walk on ~6k matches with shrinkage ≈ 0; production on 1,036 matches with shrinkage ≈ 0.95). Aligning them makes the gate honest.

Two acceptance gates.

  • STRICT (ensemble-on): median Brier on the DC+HP uniform mean strictly lower than the no-offset baseline across walks. Same bar Model 16 / 18 cleared.
  • CONTENT (waterfall-only): on the production fit (1,036 rows), n_train ≥ 800 AND shrinkage_factor < 0.95 AND ≥ 10 distinct cells with |δ_h| > σ/2 or |δ_a| > σ/2. Threshold σ/2 = 0.025 (a 2.5% rate shift).

If STRICT passes, the offset can ship into the ensemble predict path. If only CONTENT passes, the offset ships into the match-page decomposition waterfall only. If neither passes, the negative result is recorded and the mock stays.

Result

Per-walk strict-gate detail

Walkn_trainn_evalBrier baselineBrier offset-onΔshrink
1850450.58650.5888+0.00230.958
2895330.64310.6464+0.00330.960
392800.960
4928130.73800.7412+0.00320.961
5941230.61900.6194+0.00040.960
6964370.65710.6567−0.00040.956
7100190.50070.4990−0.00180.953
81010260.66090.6644+0.00340.953

Walk 3's eval window (2024-11-30 → 2025-02-28) contained zero WC-vs-WC fingerprinted matches — the off-season for major leagues, and most internationals in that window are CONCACAF / AFC qualifiers between non-WC teams. The walk is skipped from the aggregate.

Aggregate verdicts

STRICT gate:
  median Brier baseline  = 0.6431
  median Brier offset-on = 0.6464
  → FAILED (offset is +0.0033 worse on median)

  mean Brier baseline    = 0.6293
  mean Brier offset-on   = 0.6308
  → FAILED on mean too

CONTENT gate:
  n_train          = 1036  (>= 800)        ✓
  shrinkage_factor = 0.953  (< 0.95)       ✗ (just over)
  non-zero cells   = 4      (>= 10)        ✗
  → FAILED

Top-5 strongest learned cells (production fit)

Pairδ_homeδ_away
balanced_vs_transition-heavy+0.0495−0.0071
balanced_vs_possession-dominant−0.0293−0.0131
counter-attacker_vs_pragmatic+0.0281−0.0063
high-press_vs_pragmatic+0.0233+0.0211
balanced_vs_balanced+0.0203−0.0170

The strongest single cell sits at +0.0495 — essentially at the σ=0.05 prior boundary, i.e. the data has not produced credible evidence to overcome the prior pull toward zero. The next four are well below the boundary. Most of the remaining 59 cells have |δ| < 0.011.

Why the lift didn't materialise

Three signals point at the same explanation: on this corpus, with these baselines, there is no usable per-style-pair residual left for the model to learn.

  1. Per-walk shrinkage 0.953–0.961. With σ=0.05 and 1k training rows distributed across 64 cells (mean ≈ 16 matches/cell, median ≈ 12, sparse tail of 6 cells with <5 matches), the prior dominates the likelihood. The MAP fit barely moves off zero. This is the expected behaviour for a weakly-informative prior on a small-effect-size signal at n=1k — and matches what published research on style-based effects in football reports.

  2. Strict-gate Brier deltas are tiny and inconsistent. Per-walk deltas range from −0.0018 to +0.0034. The largest movers (walks 7 ▼ and 6 ▼) are in the +/− single-thousandths range. Mean Brier difference +0.0015 across the 7 evaluable walks. With ~186 total holdout matches across all walks, the standard error on the difference is comparable to or larger than the point estimate. The signal isn't there.

  3. Pre-existing offsets already absorb the team-level signal. The §2.3 composite-α (Model 16, α=0.05) and §2.6 GK-α (Model 18, α=0.05) offsets are already ON in the production ensemble. By design, the style-matchup fit's baseline λ turns those offsets OFF — so the residual being modelled is the bare DC residual. But the production prediction path applies composite-α and GK-α on top of DC's bare λ. When the style-matchup pair effects are then applied at predict time in addition, they are competing with two already-calibrated team-level offsets for the same goal-process residual variance. Empirically, the competition zeros out — no Brier lift.

A v1 attempt that jointly re-tunes composite-α and GK-α with the style-matchup offset on is theoretically possible, but the priors here aren't promising: if a small per-fixture style-matchup signal were extracting variance the composite-α offset misses, we would expect at least some per-walk improvement; we see one walk (n=9 holdout) improve, and the median walk worsen.

Decision

Do not ship the production artefact. data/wc2026/style_matchup.json is NOT written by this run. The match-page decomposition waterfall continues to consume style_matchup.mock.json (illustrative pair effects on the published-research scale, labelled model_version: "layer2-mock-v0" and n_train: 0). Compliance per CLAUDE.md §12 is unaffected — the mock has shipped since the §2.7 Track A landing and presents the framing the published note recommends.

Do ship the code, the training join, and the gate verdict so the experiment is reproducible and the negative result is on the record. Future-agent path: the corpus roughly doubles after WC2026 — rerunning the same gates against the post-tournament pool is a one-command operation, and a small genuine pair-effect signal may emerge with the larger n. The artefact path remains wired in scripts/build_match_pages.py so if style_matchup.json ever ships, the integration is one file-drop away.

Mirror the published note to web/public/research/notes/style-matchup-fit.md per CLAUDE.md §15.

Implications for the published numbers

None. The production ensemble's P(H/D/A) does not change. The match-page decomposition waterfall continues to render style_matchup.mock.json's illustrative pair effects. The methodology page gains a Model 17 entry documenting the failed gate alongside Model 14 (set-piece-aware DC, also experimental, also failed).

What this backtest can't tell you

  • Whether style-matchup effects are real in the population. This experiment estimates them on 2015–2026 international results between 40 WC2026 teams. Other corpus slices (club, women's, lower-tier internationals) might surface a signal this corpus doesn't.
  • Whether a richer style representation than 8 categorical labels would help. The continuous-bilinear s_h^T M s_a formulation (Track B-cont in the roadmap) was parked behind §4.7 action-embedding style vectors precisely because the categorical 8×8 grid was expected to be the cheaper, faster-to-ship variant. A continuous representation might extract a signal the categorical can't — but it depends on §4.7-grade inputs that we don't yet have.
  • Whether different baselines (e.g. composite-α and GK-α offsets ON at fit-time) would change the result. The version tested here uses bare DC baselines for fit-time identifiability. A simultaneous-fit version is a v1 follow-up.

Files touched

scripts/build_style_matchup_training.py       # new
scripts/backtest_style_matchup.py             # new
tests/test_build_style_matchup_training.py    # new — 11 offline tests
tests/test_backtest_style_matchup.py          # new — 13 offline tests
documentation/research-notes/style-matchup-fit.md   # this note
web/public/research/notes/style-matchup-fit.md      # public mirror
documentation/methodology.md                  # + Model 17 section
.gitignore                                    # + 2 regenerable artefacts

The Track B fitter (scripts/fit_style_matchup.py), the mock-aware loader (scripts/build_match_pages.py:load_style_matchup), and the synthetic-data fit test (tests/test_fit_style_matchup.py) all pre-date this work and are unchanged.

Reproducing

# Build the training pool from results.csv + tactical_fingerprint.csv + DC params.
.venv/bin/python scripts/build_style_matchup_training.py

# Run the walk-forward gate harness (8 folds × 90 days, ~2-3 min wall-clock).
.venv/bin/python scripts/backtest_style_matchup.py

# Inspect the verdict.
.venv/bin/python -c "import json; bt=json.load(open('data/wc2026/style_matchup_backtest.json')); \
    print(json.dumps({'strict': bt['strict_gate'], 'content': bt['content_gate']}, indent=2))"

To rerun after the corpus grows (post-WC2026):

.venv/bin/python scripts/build_style_matchup_training.py
.venv/bin/python scripts/backtest_style_matchup.py --write-production

--write-production only writes style_matchup.json if the CONTENT gate passes; the STRICT gate gates the ensemble-on flag separately (see scripts/ensemble.predict_match if/when wiring lands).