Can international-tournament StatsBomb signals beat the club-derived baseline?

Status: Complete — A1 + B1 run on the 8-walk / 90-day harness (snapshot 2026-05-28). Verdict: neither ships. B1 fails its gate outright; A1's gate "passes" but on an extremization artefact, not a keeper-skill signal. See the Results section at the end. Author date: 2026-05-28 (design); results appended 2026-05-28. Companion code (current): scripts/build_intl_gk_psxg.py, scripts/build_intl_set_piece_conv.py, scripts/build_starting_gk_rating.py, scripts/calibrate_gk_offset.py, scripts/fit_dixon_coles.py, scripts/backtest_models.py. Companion data (current): data/wc2026/intl_gk_psxg.csv (per-team, 48 rows, 40 with coverage), data/wc2026/intl_set_piece_conv.csv (per-team, 48 rows, 40 with coverage), data/wc2026/starting_gk_rating.csv, data/wc2026/set_piece_xg_share.csv.

Hypothesis

PR #525 + PR #532 produced two new per-team signals extracted from StatsBomb open event data across WC 2018/2022, Euro 2020/2024, Copa America 2024, AFCON 2023:

intl_gk_psxg.csv — per-team shot-stopping signal: on-target pre-shot xG faced vs goals conceded, summed across the 6 covered tournaments.
intl_set_piece_conv.csv — per-team set-piece efficiency: xG generated from corners + attacking-half free kicks, normalised per opportunity.

The existing ensemble carries two channels these could plug into:

GK channel. gk_rating.csv → starting_gk_rating.csv → _load_gk_offset (centred against cohort mean) → ensemble.py applies α × centred rating as a per-fixture defence offset. The current gk_rating.csv is club-football-derived (last three complete Big-5 seasons of PSxG from the JaseZiv keepers parquet, blended with a log(intl_caps) prior). Last gate-pass result: α = 0.05 wins, +1.16 bp Brier vs no-offset baseline, gate_passed: true (data/wc2026/gk_offset_config.json).
Set-piece channel. fit_dixon_coles.py --set-piece-aware consumes set_piece_xg_share.csv (FBref-derived per-team set-piece fraction of total xG, 2018-2023 club football) to add a per-team set_piece_lift offset to the DC log-rate. Backtest-only — production ensemble.py reads baseline dixon_coles.json, not dixon_coles_set_piece.json.

The hypothesis: tournament-mode signals from StatsBomb are different enough from club-form signals that they earn a place in one or both gated channels — either as a replacement, a blend, or a parallel offset.

Why this might gate where style-matchup failed

The style-matchup channel (scripts/fit_style_matchup.py + scripts/backtest_style_matchup.py, derived from the SAME build_team_style_vectors.py event stream) failed both its strict gate and content gate. Two structural reasons to expect different results here:

Per-pair vs per-team. Style-matchup is an interaction effect: it learns pair-specific log-rate offsets from team-A-vs-team-B style mismatches. That requires dense pair-coverage and degrades on sparse 8-walk gates. The two new candidates are per-team scalars — main-effect log-rate offsets that integrate as simple shifts to a team's defensive or set-piece propensity. Per-team additive scalars learn from much less data.
Existing gated channels. Both candidates plug into channels that already gate at the same harness (8 walks × 90d, conjunction Brier + ECE). The infrastructure is proven; we're swapping the input feature, not inventing a new offset path.

A negative result here is still informative — it would say that for the WC2026 cohort, club PSxG already captures the keeper signal and tournament-mode adds no incremental lift. That's a publishable finding, equivalent in posture to tier_weights_negative_result and style-matchup-fit.md.

Experiment A — GK shot-stopping (intl PSxG vs club PSxG)

Wire-in

build_starting_gk_rating.py reads gk_rating.csv (per-player) and joins to predicted_squads.json to produce starting_gk_rating.csv (per-team starting GK + rating). _load_gk_offset in ensemble.py then loads, centres against cohort mean, and applies α.

The intl signal is per-team aggregate, not per-keeper. Wire-in:

Modify build_starting_gk_rating.py to also read intl_gk_psxg.csv.
Add a new intl_psxg_rating column to starting_gk_rating.csv (or write a sibling table) carrying the z-scored psxg_proxy_per_match per team.
Modify _load_gk_offset (or fork a _load_intl_gk_offset) to consume the chosen column.
Re-run calibrate_gk_offset.py with the new signal.

Bias handling

The goals_prevented_proxy column in intl_gk_psxg.csv has a systematic 2.18× negative bias because pre-shot xG (0.13 per SoT) understates actual goal conversion rate of on-target shots (0.29 per SoT). The consumer (_load_gk_offset) already centres ratings against the cohort mean, so the bias absorbs into the constant — the centred deviations are unbiased ranking signal. No code change for bias correction.

Variants (run cheapest first)

A1 (replace) — for the 40 covered teams, set gk_rating = z-scored intl_psxg_proxy_per_match; for the 8 uncovered teams keep the club PSxG rating. Simplest test of "is tournament-mode signal better than club signal in isolation."
A2 (blend) — sample-weighted blend of the two ratings. Weights from matches_used (intl, 0-26) and n_seasons (club, 0-3) normalised to a [0, 1] confidence each. Teams with strong signal from both get a true blend; teams with strong signal from only one get that one dominantly.
A3 (parallel offset) — keep club PSxG as the primary offset; add intl PSxG as a second offset term with its own α (α₂). Two-axis grid search in calibrate_gk_offset.py. Most expressive but most likely to overfit on the 8-walk gate; only run if A1 or A2 looks promising.

Gate

Existing conjunction (metrics.apply_conjunction_gate):

median Brier strictly lower than baseline (alpha = 0 in the new α grid for that variant), AND
median ECE within +0.2pp of baseline.

8 walks × 90-day evaluation windows (current default). Same harness as the production gk_offset_config.json so results are apples-to-apples.

Baseline for comparison

Two baselines to report for each variant:

alpha = 0 for the same α grid (no offset).
Current production: club-PSxG gk_offset at α = 0.05, gate-passed at +1.16 bp.

The bar is to strictly beat baseline (2). Beating only baseline (1) means the new signal is as good as the old — useful but not a ship reason.

Experiment B — Set-piece conversion (intl vs club share)

Wire-in

fit_dixon_coles.py --set-piece-aware reads set_piece_xg_share.csv and uses each team's centred set_piece_xg_share (a fraction) as a feature scaled by SP_FEATURE_SCALE in log-rate units. backtest_models.py --set-piece-aware is the only invocation point in the repo.

Schema translation

set_piece_xg_share.csv has set_piece_xg_share (fraction of total team xG that's set-piece-derived, 2018-2023 club). intl_set_piece_conv.csv has sp_xg, sp_attempts, sp_xg_per_attempt. Not directly substitutable.

Simplest translation: compute sp_xg_per_match = sp_xg / matches_used for the intl table — gives a per-team scalar comparable across the cohort. Modify load_set_piece_shares in fit_dixon_coles.py to accept either input format (sniff by column presence) and centre whichever it loads.

Variants

B1 (intl-only) — for covered teams use intl sp_xg_per_match; for uncovered teams use club set_piece_xg_share translated to a comparable scale (or just zero / mean).
B2 (blend) — sample-weighted blend by matches_used (intl) and n_player_seasons (club).

Gate

backtest_models.py --set-piece-aware already reports Brier, log-loss, and ECE per model (DC baseline, DC set-piece-aware, HP, ensemble). Gate: DC-set-piece-aware median Brier strictly lower than DC baseline, ECE within +0.2pp. Same walk specification as the GK gate (8 × 90d).

Shipping decision (separate from gating)

Even if B gates, the production ensemble.py reads dixon_coles.json (baseline DC), not dixon_coles_set_piece.json. Shipping requires either:

a separate small PR to switch ensemble.py's DC component to the set-piece-aware fit, OR
a separate small PR to fold the set-piece-aware coefficients into the baseline DC fit as the default.

Either is a one-file change. Out of scope for this experiment — the experiment provides the evidence; the shipping decision can be deferred.

Coverage fallback

8 of 48 WC2026 teams have no StatsBomb coverage (matches_used = 0):

Bosnia and Herzegovina, Curaçao, Haiti, Iraq, Jordan, Norway, New Zealand, Uzbekistan

Reason: none appeared in WC 2018/2022, Euro 2020/2024, Copa 2024, or AFCON 2023. All variants fall back to the existing club-derived rating/share for these teams, matching the graceful behaviour gk_rating.csv already has for keepers without Big-5 history (null → 0 offset post-centring).

Negative-result protocol

If A or B (or any sub-variant) fails its gate:

Persist the backtest JSON next to the existing artefacts:
- data/wc2026/gk_offset_config_intl.json for A
- data/wc2026/dixon_coles_set_piece_intl.json for B
Update this design note with a "Result" section reporting the gate outcome, baseline-vs-variant numbers, and a "what we learned" paragraph.
Do not wire into production. Do not ship.
Cross-link from this note to the existing negative results — style-matchup-fit.md, tier_weights_negative_result memory — so the corpus of "tried, didn't work" experiments stays browsable.

Sequencing & estimated effort

A1 (1-2h) — single-file modification to build_starting_gk_rating.py + re-run calibrate_gk_offset.py. Cleanest, simplest test. Result gates whether A2/A3 are worth trying.
B1 (3-4h) — new sp_xg_per_match column on intl_set_piece_conv.csv (one line) + fit_dixon_coles.py schema-sniff modification + re-run backtest_models.py --set-piece-aware. Parallel to A1, different consumer.
A2 / A3 (1-2h each) — only run if A1 shows lift but doesn't clear the gate. A3 last (most overfit-prone).
B2 (1h) — only run if B1 shows lift but doesn't clear the gate.
Shipping PR(s) (1h each) — only if a variant passes. Separate PR per channel.

Total worst-case if both gate and we ship both: ~12 hours across one or two sessions. If both fail at A1/B1, ~5 hours including the writeup.

Open questions

Cohort-mean centring vs per-confederation centring. Currently the GK offset centres against the global cohort mean. With intl data, AFCON-only teams (Algeria, DRC, CPV, Ivory Coast) have signal from a less-competitive opponent pool than UEFA-only teams. Cohort centring may unfairly penalise / reward across confederations. Worth a sensitivity check if A1 is borderline.
Sample-size shrinkage. Teams with 3-5 matches of coverage (Paraguay, Algeria, Qatar, Cape Verde) have noisier signals than teams with 21-26 matches (Croatia, England, France, Spain). Shrinkage toward the cohort mean by matches_used would reduce noise — could be added inside A2/B2 as a refinement.
Are we double-counting? gk_rating.csv blends club PSxG with log(intl_caps) as a coaching-confidence prior. International caps overlap loosely with international match coverage. Variant A1 (full replace) breaks the caps anchoring entirely; A2 (blend) preserves it. Worth quantifying how much the existing caps prior weights drive the current rating.

Results

Run on 2026-05-28, 8 walks × 90-day evaluation windows, the same harness the production gk_offset_config.json uses. Wiring shipped in this PR: build_starting_gk_rating.py --intl-psxg, build_set_piece_share_intl.py, plus --gk-csv (calibrator) and --set-piece-csv / --out (backtest) overrides. The variant CSVs and result JSONs are gitignored (regenerable); the numbers below are the durable record.

B1 — set-piece conversion (intl-only): gate FAILED

backtest_models.py --set-piece-aware --set-piece-csv set_piece_xg_share_intl.csv, aggregate across 8 walks (n = 1,952 common matches):

model	Brier mean	Brier median	ECE (wgt)
Dixon-Coles (baseline)	0.5038	0.5116	5.8 pp
Dixon-Coles (set-piece-aware, intl)	0.5123	0.5089	7.0 pp

The conjunction gate needs median Brier strictly lower and ECE within +0.2 pp. Median Brier did edge down (0.5089 < 0.5116, −2.7 bp), but ECE blew out by +1.2 pp (5.8 → 7.0) and mean Brier got worse (0.5123 > 0.5038). The calibration degradation fails the gate decisively. Same posture as style-matchup-fit.md: a tournament-mode feature that does not survive the conjunction gate. Do not ship.

A1 — GK shot-stopping (intl PSxG vs club PSxG): gate "passes" but the lift is an extremization artefact, not keeper skill

Shared baseline (no offset, α = 0): Brier median 0.502457, ECE median 0.070991.

signal (scale)	best α (grid)	Brier median	Δ Brier	ECE Δ
club PSxG (raw, σ≈0.218)	0.05 (default grid edge)	0.502364	+0.92 bp	−0.23 pp
club PSxG (raw)	0.50 (wide grid edge)	0.501992	+4.64 bp	−0.66 pp
intl PSxG (unit-σ z-score)	0.05 (default grid edge)	0.501184	+12.72 bp	−0.73 pp
intl PSxG (unit-σ)	0.30 (wide grid edge)	0.498040	+44.16 bp	−0.99 pp

Taken at face value the intl signal smashes the gate (+44 bp vs the club channel's shipped +1.16 bp). It is not real, for four converging reasons:

No α turnover. Both signals improve monotonically to the grid edge and never turn over (club still climbing at α = 0.5, intl at α = 0.3). "Best α at the grid edge" is the signature of a global sharpening knob, not a feature finding its natural weight.
The harness scores uncalibrated probabilities. calibrate_gk_offset._evaluate_one_walk_with_fits Briers the raw uniform-average of Elo + DC + HP — no Platt temperature, no extremization (calibrator_hybrid_window). A raw goal-model ensemble is under-dispersed for 1X2, so any wide-spread, team-discriminating multiplicative offset on λ sharpens it and improves Brier. The GK channel is doubling as an extremizer. The production path already extremizes + calibrates, so this gain very likely does not transfer.
Implausible effect size. The intl signal is unit-σ, so α = 0.3 applies log-offsets up to ±0.3–0.9 — a ±30–90 % swing in expected goals attributed to the opponent's keeper. That is not a shot-stopping effect.
It isn't even team strength. r(psxg_proxy_per_match, latest Elo) = 0.27 over the 40 covered teams — only ~7 % shared variance. So the channel is neither clean keeper skill nor a clean team-strength proxy; it's whatever residual direction best extremizes the under-dispersed backtest. (Eval-window overlap is a further confound: Euro 2024 + Copa América 2024 — two of the six source tournaments — fall inside the earliest walks, so the static present-day covariate carries information about its own holdout matches.)

Matched-scale comparison. The only apples-to-apples read is at equal effective log-nudge per σ. Club at α = 0.05 nudges ≈0.011 log; intl at α = 0.01 nudges ≈0.010 log and yields +2.7 bp (Brier median 0.502187). So at matched gentle extremization the intl signal beats the club signal by a small, real margin (~+2.7 vs +0.9 bp) — consistent with "marginally better team-defence feature," nothing like the headline +44 bp.

Verdict: do not ship A1. The intl PSxG signal is at best a small improvement over the club GK rating, and the GK-offset calibration gate as built is not a valid acceptance test for wide-spread multiplicative features — it rewards extremization the production calibrator already performs. A2 / A3 (blend / parallel-offset) were not run: A1 did not produce a trustworthy lift, so the more expressive (more overfit-prone) variants would only inherit the same confound.

What we learned (carry forward)

The gk-offset gate needs to score calibrated probabilities (or compare features at matched effective nudge magnitude). Scoring the raw ensemble conflates feature skill with extremization. This also retroactively weakens confidence in the shipped club gk_offset (+1.16 bp) — though that runs at a gentle α = 0.05 on a narrow-σ signal, so its extremization component is tiny.
Demand an α turnover before trusting a "best α." If Brier is still falling at the widest α tested, widen the grid until it turns over (or regularises), or the optimum is an artefact of the grid boundary.
For any future GK feature: per-walk decomposition to isolate tournament-source / eval-window overlap, and orthogonalise the signal against team Elo + DC defence to isolate keeper-specific skill.

Both results join the project's "tried, didn't work" corpus alongside style-matchup-fit.md and the tier_weights_negative_result memory. Nothing from this experiment is wired into production.

Regenerate (artefacts are gitignored):

python scripts/build_starting_gk_rating.py --intl-psxg
python scripts/build_set_piece_share_intl.py
python scripts/calibrate_gk_offset.py --gk-csv data/wc2026/starting_gk_rating_intl.csv \
    --folds 8 --window-days 90 --today 2026-05-28 --out data/wc2026/gk_offset_config_intl.json
python scripts/backtest_models.py --set-piece-aware \
    --set-piece-csv data/wc2026/set_piece_xg_share_intl.csv \
    --folds 8 --window-days 90 --today 2026-05-28 \
    --out data/wc2026/backtest_set_piece_intl.json \
    --out-walk data/wc2026/backtest_walk_forward_set_piece_intl.json

Vollständige Notiz · kostenlos

Hypothesis

Why this might gate where style-matchup failed

Experiment A — GK shot-stopping (intl PSxG vs club PSxG)

Wire-in

Bias handling

Variants (run cheapest first)

Gate

Baseline for comparison

Experiment B — Set-piece conversion (intl vs club share)

Wire-in

Schema translation

Variants

Gate

Shipping decision (separate from gating)

Coverage fallback

Negative-result protocol

Sequencing & estimated effort

Open questions

Results

B1 — set-piece conversion (intl-only): gate FAILED

A1 — GK shot-stopping (intl PSxG vs club PSxG): gate "passes" but the lift is an extremization artefact, not keeper skill

What we learned (carry forward)