Research note

Can video highlights produce useful match signal? (Early, promising, not shipped)

Status: Display surfaces shipped: match ratings (`web/public/video_match_ratings.json`), player stats (`web/public/video_player_stats.json`), team grading (`web/public/team_grading.json`). Prediction offset (Elo delta from video-derived performance scores) NOT shipped. Corpus is 8 fixtures / 16 team-observations as of 2026-06-14. Needs 30+ fixtures with matched scored-prediction baselines and a clean placebo gate before any production use. The pipeline itself is working and stableFree to read in full · 1,840 words

Full note · free

Status: Display surfaces shipped: match ratings (web/public/video_match_ratings.json), player stats (web/public/video_player_stats.json), team grading (web/public/team_grading.json). Prediction offset (Elo delta from video-derived performance scores) NOT shipped. Corpus is 8 fixtures / 16 team-observations as of 2026-06-14. Needs 30+ fixtures with matched scored-prediction baselines and a clean placebo gate before any production use. The pipeline itself is working and stable. Pipeline date: 2026-06-12 Code: scripts/find_highlight_urls.py, scripts/build_video_notes.py, scripts/build_video_match_ratings.py, scripts/build_video_player_stats.py, scripts/build_video_team_grading.py, scripts/video_performance_delta.py Research output (gitignored): data/wc2026/video_performance_deltas.json, documentation/private/video_notes/*.json Public output: web/public/video_match_ratings.json, web/public/video_player_stats.json, web/public/team_grading.json Follows: documentation/research-notes/player-form-offset.md (form-as-model-input: negative), documentation/research-notes/composite-coverage-backfill.md (coverage is not the lever)

Problem

The model forecasts match outcomes using pre-match information only: Elo ratings, Dixon-Coles parameters, historical xG, squad composites. Once a tournament starts, the model's view of a team can only update through Elo (which reacts to results) and, in theory, any supplementary signal about how a team is actually performing on the pitch.

Match statistics (possession, xG, shots) are one source of that signal, but they arrive slowly for international tournaments and miss qualitative patterns that experienced watchers notice immediately: a team's defensive shape falling apart, a player running the midfield, a set-piece routine that looks rehearsed. Official highlight packages (8 to 12 minutes per match) compress this information into a watchable format that covers the key moments.

The question: can a multimodal LLM watch official highlight videos and extract structured observations useful for (a) enriching the site's match coverage, and (b) adjusting the model's team-strength estimates between rounds?

Pipeline design

The pipeline has six stages. The first two run per-fixture, manually triggered. The remaining four run in batch after new notes accumulate.

Stage 1: URL discovery (find_highlight_urls.py)

Searches the YouTube Data API for official highlight videos matching each fixture. Scores candidates by source quality (FIFA, FOX Sports, BBC Sport get the highest weight), duration (300 to 900 seconds is the sweet spot for full highlight packages), and keyword signals. Returns up to 3 ranked videos per fixture into data/wc2026/highlight_urls.json.

A negative-keyword filter removes reactions, compilations, predictions, and betting content. Duration floor is 30 seconds (clip too short to contain useful tactical information), ceiling is 3600 seconds (full-match re-uploads are not highlights).

Stage 2: Video observation (build_video_notes.py)

The core extraction step. Sends a public YouTube URL to the Gemini API, which ingests the video natively without any download or storage. The prompt asks for three structured outputs:

  1. Events (array): minute, type (goal / red card / yellow card / penalty / injury / other), one-sentence note.
  2. Player notes (array): name, one-sentence observation grounded in visible moments.
  3. Match notes (array): 2 to 5 sentences on the shape of the match as shown in the footage.

The prompt instructs the model to report only what it can see. If a detail is unclear, it should omit rather than guess.

Forbidden-vocabulary gate. Every string in the response passes through the same has_forbidden_vocab filter used by the match-context editorial system. The gate checks for betting-action words (per CLAUDE.md §3/§4). A single hit discards the entire response. This is deliberate: the error-on-vice design means a false positive costs one retry, while a false negative would leak compliance-violating language into published surfaces.

Rights posture. Facts about a match (who scored, who was carded) are not copyrightable. The video footage is. This tool watches and takes notes; it never copies, stores, or re-distributes any frames. Only public YouTube URLs are accepted, and the Gemini API handles playback server-side.

Output goes to documentation/private/video_notes/<fixture>.json (gitignored). Nothing from this stage auto-publishes. A human reviews the notes before any downstream use.

Stage 3: Match ratings (build_video_match_ratings.py)

Reads the private video notes and asks Gemini to score every named player on a 1-to-10 scale:

ScoreMeaning
1-3Poor: red cards, costly errors, missed clear chances
4-5Below average
6Average (default for players who appeared without distinction)
7Good
8Very good: goal, assist, or dominant defensive display
9-10Exceptional: multiple goals/assists, controlled the match

Squad rosters from predicted_squads.json assign each player to a side. The script selects the top 3 rated players per team and writes the results to web/public/video_match_ratings.json for display on recap pages.

Stage 4: Player event classification (build_video_player_stats.py)

Uses the LLM to classify each player event into a structured taxonomy: goal, yellow card, red card, shot on target, shot off target, save, defensive block, foul committed, foul won, header, substitution, set piece, other. Deduplicates against the ratings data, then aggregates per-player stats across all analysed matches. Output: web/public/video_player_stats.json.

Stage 5: Team grading (build_video_team_grading.py)

Grades each team's performance on seven dimensions, each scored 1 to 10:

  1. Passing and build-up (anchored to pre-tournament possession share, build-up passes per attack, tempo)
  2. Pressing and defensive intensity (anchored to PPDA)
  3. Set pieces (anchored to set-piece routine variety)
  4. Chance creation (anchored to final-third entries per 90, shots per 90, crossing rate)
  5. Shot quality and finishing (anchored to xG per shot)
  6. Team cohesion and shape
  7. Individual quality

Each dimension is compared against the team's pre-tournament style profile from team_style_vectors.csv, producing a verdict: "overperformed", "underperformed", or "in line". The script assigns a letter grade (A+ through F) and a form assessment. Player-level grades on five dimensions (passing, defending, movement, technical, impact) accompany the team grades.

Public output: web/public/team_grading.json. Detailed backup: data/wc2026/team_grading_detail.json.

Stage 6: Performance delta (RESEARCH, video_performance_delta.py)

The only stage that touches prediction. Takes the per-player scores from stage 3, computes a team average, subtracts a baseline (6.5 neutral, adjustable by pre-match win probability), and converts the delta to an Elo adjustment via a conservative k-factor of 15. A delta of +1.0 (team scored a full point above baseline) maps to +15 Elo points.

This is the stage that has NOT shipped and carries the explicit header: "RESEARCH — must clear placebo before shipping."

Early results from the delta research

8 fixtures observed, 16 team-level observations as of 2026-06-14. The data shows clear noise problems at this corpus size.

TeamDeltaElo adjPlayers scoredAvg score
rsa-0.86-12.9145.64
can-0.75-11.2125.75
sco-0.50-7.566.00
tur-0.50-7.516.00
bih-0.37-5.5156.13
mex-0.30-4.5106.20
cze-0.06-0.996.44
kor-0.05-0.8116.45
usa+0.17+2.666.67
hai+0.50+7.577.00
par+0.50+7.537.00
mar+1.00+15.027.50
qat+1.50+22.518.00
sui+1.50+22.518.00
bra+1.50+22.518.00
aus+1.70+25.558.20

Mean delta across all observations: +0.31. Standard deviation: 0.88.

What the numbers show

Player detection is the bottleneck. The "players scored" column ranges from 1 to 15 per team per match. 6 of the 16 observations have 3 or fewer players scored. Every observation with n=1 shows an extreme delta (+1.5 or -0.5), because one player's score IS the team average. The observations with the most players scored (rsa=14, bih=15, can=12) produce moderate, plausible deltas. This is a classic small-sample noise problem, compounded by the LLM's uneven ability to identify players from highlights (name overlays vary by broadcaster, some substitutes are never captioned).

Highlight coverage is partial. A 10-minute highlight reel covers roughly 11% of match time, heavily biased toward goals, cards, and near-misses. A team's defensive shape, midfield control, or pressing intensity may never appear in the highlights. The grading dimensions in stage 5 partially compensate by asking the LLM to assess patterns rather than count events, but the input is still a curated subset of the match.

Direction looks reasonable, magnitude is unreliable. Teams that won upset victories (aus vs tur: delta +1.7) get positive adjustments. Teams that lost to weaker opponents (can vs bih with can at -0.75, although bih also shows -0.37, consistent with a scrappy, low-quality match). The direction often aligns with match narrative, but the magnitudes for low-n observations would produce wild Elo swings if shipped.

Why the prediction offset is not shipped

Three gating criteria, all currently failing:

  1. Corpus size. 8 fixtures is far too few for any meaningful evaluation. The player-form offset research note showed that even with a full backtest corpus, a per-match adjustment that feels intuitively reasonable can be inert-to-harmful on Brier. We need at least 30 fixtures with matched scored-prediction baselines before running a placebo gate.

  2. Player detection reliability. Until the "players scored" count is consistently above 8 per team per match, the team-average score is dominated by sampling noise. Possible improvements: (a) feeding the full squad list into the prompt more aggressively, (b) requiring the LLM to score every starter even if not seen in highlights (with an explicit "not visible" flag), (c) supplementing highlights with post-match text reports. None of these are in place yet.

  3. Placebo gate. Both prior attempts to add per-match adjustment signals to the model (player-form offset, within-match game-state features) failed their placebo checks. Any video-derived offset must clear the same standard: randomised-label placebo must show zero effect, and the real-label version must beat it on median Brier without worsening ECE.

What IS shipped

The non-prediction outputs (stages 3 through 5) are live on the site:

  • Match ratings appear on fixture recap pages, showing each team's highest-rated players from the video analysis.
  • Player event stats aggregate per-player event counts across the tournament.
  • Team grading provides 7-dimension performance grades that compare what teams showed on the pitch to their pre-tournament statistical profiles.

These are display-only features. They enrich match coverage for readers but do not feed back into the model's probability calculations.

What comes next

  1. Grow the corpus. Every match day adds 6 to 8 new fixtures. By the end of the group stage (48 matches), the delta dataset should be large enough to test.
  2. Fix player detection. The immediate priority is getting consistent 11+ player scores per team per match. The prompt already receives the full squad list, but the LLM often only names players featured in highlight captions.
  3. Placebo gate. Once the corpus reaches 30+ fixtures with matched baselines, run the same walk-forward placebo design used in the player-form offset study.
  4. Baseline adjustment. The current flat 6.5 baseline ignores pre-match strength. A win-probability-adjusted baseline (teams expected to win should score higher on average) would reduce the confound between "team played well" and "team was already strong."