How we make predictions

What we predict and how

3,100단어

For every prediction target — match outcomes, goal totals, scorelines, individual player events — there's a standard modelling approach and a set of input variables. This page catalogues all of them in one place.

Cross-refs:

  • team-modeling.md — team-strength model families (Dixon-Coles, Elo, Bayesian, ML).
  • player-quality.md and player-expected-value.md — player ratings and action-value models.
  • contextual-factors.md — home advantage, fatigue, weather, referees, manager effects.
  • data-sources.md and data-sources-matrix.md — where each variable comes from.

The pattern across all targets is the same: a probabilistic model produces P(target) from a feature set, and the model is evaluated by its calibration and resolution on held-out data. What differs target-to-target is the response distribution and the salient features.


1. Match outcome — 1X2 (home / draw / away)

What's modelled. The discrete distribution P(home_win), P(draw), P(away_win) over 90 minutes (extra time excluded for league conventions).

Standard approaches.

  • Goal-process: independent or bivariate Poisson with Dixon-Coles low-score correction; integrate the score distribution to get 1X2.
  • Latent ratings: Elo / Glicko / SPI mapped through a calibrated logistic to 1X2.
  • Hierarchical Bayesian Poisson (Baio-Blangiardo lineage).
  • Discriminative ML: XGBoost / LightGBM / random forest / MLP with engineered features.
  • Ensembles: weighted blend of a goal-process model and a discriminative model — common in published 2023–25 work (Hubáček-style stacking).

Variables typically used.

Team strength and form

  • Latent attack/defence parameters or Elo rating (current and rolling).
  • Rolling xG-for and xG-against over last N matches (5 / 10 / season).
  • Rolling goals-for / goals-against (less informative than xG but still used).
  • Rolling points-per-match.
  • Pi-ratings (Constantinou-Fenton split home/away) — used in several recent benchmarks.

Match context

  • Home / away / neutral.
  • Home-advantage parameter, league- and season-specific.
  • Days since last match for each side (rest).
  • Travel distance / time-zone delta since last match.
  • Midweek European fixture indicator (UCL/UEL hangover).
  • Competition tier (league vs cup vs continental).
  • Stage of season; "dead rubber" indicator near season end.

Lineup and availability

  • Starting-XI strength — sum of player ratings, role-weighted (when XI is known).
  • Indicators for key absentees (top scorer out, first-choice GK out, etc.).
  • Manager identity / manager-change indicator within last N matches.

Common evaluation metrics. Log-loss, ranked probability score (RPS), Brier, accuracy, calibration plots, reliability diagrams.


2. Goal-differential distribution

What's modelled. P(home_goals − away_goals = d) for integer d, and derived quantities like P(home wins by ≥ k goals).

Standard approaches.

  • Skellam distribution on home_goals − away_goals, fit directly or derived from a bivariate Poisson.
  • Same goal-process or rating models as 1X2, but read off the goal-difference distribution rather than the marginal outcome.

Variables. Same as 1X2, with extra weight on:

  • Expected goal-difference rather than win probability.
  • xG differential rolling features.
  • Margin-of-victory adjusted Elo (Goals-Adjusted Elo / SPI).

3. Total goals distribution

What's modelled. P(home_goals + away_goals = n) and derived thresholds like P(total ≥ 3).

Standard approaches.

  • Sum of two Poissons (or bivariate Poisson with correlation) → total-goals distribution.
  • Negative binomial on total goals if over-dispersion is detected (lower-tier leagues, friendlies).
  • Discriminative regression on total goals.

Variables.

  • Combined expected goals: λ_home + λ_away.
  • Rolling xG-for + xG-against for both sides.
  • Pace proxies: shots per match, possession, passes per defensive action (PPDA).
  • Style indicator: high-pressing / open vs deep-block / closed (proxied by PPDA, defensive-line height).
  • Referee total-goals tendency (small effect via penalty rate and stoppage time).
  • Weather (rain, wind reduce goals slightly).
  • Pitch surface; altitude (where relevant).
  • 5-substitute rule indicator (post-2020 era → modest goals increase).
  • Stakes / motivation flag.

4. BTTS (Both Teams To Score)

What's modelled. P(home_goals ≥ 1 AND away_goals ≥ 1).

Standard approaches.

  • Direct from a bivariate Poisson: P(BTTS) = (1 − P(home=0)) × (1 − P(away=0)) plus correlation correction.
  • Discriminative classifier with goal-rate features.

Variables.

  • Per-team xG-for and xG-against (both teams must "switch on").
  • Clean-sheet rate / failed-to-score rate per side.
  • Goalkeeper PSxG-saves rating (a high-quality GK reduces BTTS via the "away to score" side).
  • Defensive shape: low-block teams reduce BTTS; high-line teams raise it.
  • All total-goals features apply in second order.

Correlation note: BTTS is mathematically tied to total-goals and should be modelled consistently with the joint goal distribution.


5. Score-grid distribution

What's modelled. Full joint distribution over (home_goals, away_goals) integer pairs, often truncated; half-time conditional distributions for HT/FT predictions.

Standard approaches.

  • Dixon-Coles bivariate Poisson with low-score correction → score grid.
  • Inhomogeneous Poisson with time-varying rate (rate by 15-minute bucket).
  • For HT/FT prediction, two independent half-time processes with separate rate per half, sometimes with a correlation parameter.

Variables. Same as 1X2 / totals, with:

  • Higher sensitivity to the ρ Dixon-Coles parameter.
  • First-half vs second-half scoring rate split.
  • Late-game scoring profile (5-sub rule pushed late goals up).

The score grid is internally useful for pricing every goal-derived event coherently — once you have the joint, all 1X2, totals, BTTS, and exact-score probabilities follow without re-fitting.


6. First / last / anytime goalscorer

What's modelled. P(player p scores at least one goal) (anytime) or the time-of-first-goal distribution combined with shooter probability (first scorer).

Standard approaches.

  • Multiplicative decomposition: P(team scores) × player_share × minutes_share × penalty-taker premium.
  • Per-player Poisson goal rate with a thinning by expected minutes.
  • Empirically: xG_per_90 × E[minutes] / 90 then exponentiate to anytime probability via 1 − e^{−λ}.

Variables.

  • Player non-penalty xG per 90, rolling and career-stable.
  • Penalty-taker status (binary or probabilistic if shared).
  • Set-piece taker indicators (free kicks, corner taker).
  • Expected minutes (start probability × minutes-given-start; sub-from-bench distribution).
  • Team total xG forecast (the player's share is multiplicative).
  • Shot location distribution (close-range shooters overperform on anytime probability relative to per-90 xG).
  • Opponent defensive strength (xG-against per 90).
  • Match script forecast: does the team chase or hold? Affects late-game shot volume for forwards.

7. Player goals (over thresholds)

What's modelled. P(player_goals ≥ k) for thresholds 1 (anytime), 2 (brace), 3 (hat-trick).

Standard approaches.

  • Per-shot Poisson: convolve expected shots × per-shot xG.
  • Two-stage: model expected shots, then distribute over xG buckets.
  • "Beyond Expected Goals" framework (Mendes-Neves et al., 2025, arXiv:2512.00203): explicit xS (expected shots) × xG per shot, often outperforming flat per-90 baselines.

Variables. Same as anytime scorer, plus:

  • Expected shots conditional on minutes and opponent.
  • Big-chance-conversion rate.
  • Inside-box vs outside-box shot share.
  • Penalty-rate forecast for the team (referee, opponent fouling profile).

8. Player shots / shots-on-target / shots-in-box

What's modelled. Count distribution (negative binomial typical) on shots / SoT / shots in box; output cumulative P(shots ≥ k).

Standard approaches.

  • Negative binomial per-90 rate × expected minutes, with role and opponent adjustments.
  • Tweedie / zero-inflated Poisson when zero-inflation is severe (rotation players).

Variables.

  • Per-90 shot rate, role-conditional (winger vs CF vs AM).
  • Expected minutes (from lineup model).
  • Opponent shots-conceded rate at the position.
  • Team total shots forecast × player share.
  • Match-script: trailing teams shoot more.
  • Set-piece taker bonus (direct free kicks count).
  • Recent-form shot volume vs season-stable shot volume — published research suggests the form component should be light-weighted; season-stable rates are the stronger signal.

These predictions are noisier for rotation-heavy attackers and creators, where role and minutes uncertainty compound.


9. Player assists and chance creation

What's modelled. P(player_assists ≥ 1) and similar.

Standard approaches. Same multiplicative framework as goalscorer: team xG × player share of key passes / xA / set-piece-delivery.

Variables.

  • xA per 90 and key-pass rate.
  • Set-piece-delivery indicator (corner taker, free-kick taker have inflated assist rates).
  • Touches in dangerous zones (final-third entries).
  • Teammate finishing quality (high-xG-overperforming forwards inflate assists for their feeders).
  • Expected minutes; match-script.

xA-style models are noisier than xG, which is why assist predictions carry higher uncertainty than goal predictions.


10. Player cards and fouls

What's modelled. P(player carded), P(player_fouls ≥ k).

Standard approaches. Logistic / Poisson with referee-tendency interaction. Card frequencies are dominated by referee identity more than player identity (see contextual-factors.md).

Variables.

  • Referee identity is the strongest single feature (per-referee yellow rate ranges roughly 3.0–5.5 per match in top European leagues).
  • Player season fouls per 90, by position.
  • Team out-of-possession share forecast (defensive midfielders foul more when their team has less ball).
  • Opponent attacking style (high-press, dribble-heavy attackers draw more fouls).
  • Match stakes / competition tier.
  • Score-state effects (trailing teams foul more late).

Card outcomes are referee-driven and modelling-tractable when referee features are included; many simple per-player baselines miss the referee interaction entirely.


11. Team / match cards and bookings index

What's modelled. P(total cards > line) or expected bookings index (e.g., 10·yellows + 25·reds).

Standard approaches. Compound count models: total cards = sum-of-team-cards, with a referee × team-tendency × opponent-tendency interaction.

Variables. Same as player cards plus:

  • Team season cards per match (by team-foul tendency).
  • Referee × team historical interaction (where data exists).
  • Derby / rivalry indicator (real and persistent uplift in card rates).
  • Game state volatility forecast (close games carry more cards in late stages).

12. Corners — total and team

What's modelled. P(total_corners > line) and team corner counts.

Standard approaches.

  • Negative binomial or compound Poisson on corner counts (corners arrive in batches around defensive set pieces; over-dispersion is high).
  • Yip's compound Poisson framework (arXiv:2112.13001) is a recent reference — explicitly models the "batch arrival" structure.
  • Or a multiplicative model: possession-share × attacking-third-entry-rate × cross-and-deep-shot-rate × opponent-low-block-rate.

Variables.

  • Per-team corners-for and corners-against per match.
  • Possession share forecast.
  • Attacking style: wide/cross-heavy teams generate more corners than central/through-ball teams.
  • Opponent defensive style (deep-block teams concede more corners).
  • Pace of play (high-tempo games = more corners).
  • Wet pitch / rain (raise corner counts modestly).
  • Match script forecast (trailing team launches more crosses → more corners).

13. Penalty events

What's modelled. P(penalty awarded) and conditional P(penalty scored).

Standard approaches. Logit on per-team draw-rate × opponent foul-in-box rate × referee penalty-award rate. Conversion conditional on shooter is small-sample and high-variance — most models shrink toward a league baseline (~78%).

Variables.

  • Referee penalty-award rate.
  • Per-team penalties-won / conceded per match.
  • Opponent fouls-in-box rate.
  • VAR-era indicator (post-VAR penalty rates ~20–30% higher).
  • Penalty-taker conversion rate (shrunk to league baseline due to small samples).

14. Player passes completed / specific passing actions

What's modelled. Count distribution for passes completed, progressive passes, key passes, etc.

Standard approaches. Linear or negative-binomial regression on per-90 with team possession × position × opponent factors.

Variables.

  • Per-90 pass count, position-conditional.
  • Team possession share forecast (deep-lying playmakers in possession-dominant sides accumulate fast).
  • Opponent press intensity (PPDA) — high-press opponents reduce pass volume.
  • Expected minutes.
  • Match script — trailing teams pass more in some shapes, less in others.

These are heavily team-style dependent and require explicit possession-and-role adjustment to predict accurately.


15. Minutes played / starting XI

What's modelled. P(player starts) and E[minutes | start, sub].

Standard approaches.

  • Logistic / decision-tree on rotation features.
  • Bayesian update on lineup-news arrival.

Variables.

  • Days since last match / rotation tendency for the manager.
  • Recent-match minutes (loaded vs rested).
  • Injury status (fit / doubtful / out / suspended).
  • Cup vs league competition.
  • Importance of next match (rotation timing around big games).
  • Distance from international break (post-break starters often rotated).
  • Lineup news from press conferences and team-news feeds.

16. In-play / live prediction

What's modelled. Continuously-updated probabilities for goals in remaining time, final score, and goal-differential given current state.

Standard approaches.

  • Inhomogeneous Poisson with state-dependent rate: λ(t | score, red cards, possession).
  • Score-state-conditioned models (Dixon-Coles in-play extensions; Robberechts et al.).
  • Recent (2025): large-scale outcome-forecasting transformers consuming event streams (arXiv:2511.18730) for in-game match/team/player predictions.

Variables. Time-varying:

  • Current score and goal-difference.
  • Time remaining.
  • Expected goals so far (in-play xG, post-shot xG).
  • Red-card events (the second-largest single-event impact after a goal).
  • Score-state-conditional shot rate forecast.
  • Possession-share momentum.
  • Substitution events.
  • Pre-match priors (still load-bearing — in-play models start from the kickoff prior and update).

Documented effects in the literature: in-play models that ignore score-state-conditional shot rates tend to under-react to early red cards and over-react to goals in the immediate post-goal window.


17. End-of-season standings

What's modelled. End-of-season probability distributions: P(team wins league), P(top-4), P(relegated).

Standard approaches. Monte Carlo over remaining fixtures using a per-match win-prob model; aggregate to season-end ranks across (typically) 10k–100k simulations. FiveThirtyEight SPI made this approach famous; ClubElo + a logistic does the same.

Variables.

  • Current points table.
  • Per-match 1X2 model output for every remaining fixture.
  • Team-strength updates as the season progresses (state-space team-strength).
  • Tiebreaker rules (goal difference, head-to-head — coded explicitly).
  • Fixture-difficulty asymmetry across rivals.

18. Top scorer / most assists / most clean sheets

What's modelled. P(player p ends season as top scorer), etc.

Standard approaches. Monte Carlo over remaining fixtures with a per-player-per-match goal-rate model; aggregate to season-end leaderboard.

Variables.

  • Per-player goal/assist rate per 90.
  • Penalty-taker share (huge effect — penalty takers gain ~5–8 expected goals per season).
  • Expected minutes / start probability for remainder of season.
  • Team total xG forecast and player share.
  • Injury risk (durability).
  • Transfer-window departure risk.

19. Tournament winners (UCL, Euros, World Cup)

What's modelled. P(team wins tournament), conditional on group draw and bracket.

Standard approaches. Monte Carlo through the bracket with a per-match win-prob model; for groups, simulate to qualification, then sample brackets. Penalty-shootout uncertainty added explicitly (typically a 50/50 with small adjustments for keeper save-rate and penalty conversion).

Variables.

  • All match-outcome features.
  • Bracket structure and seeding.
  • Squad-quality metric (sum-of-XI rating, depth proxy).
  • Manager continuity / experience indicator.
  • Travel and time-zone effects (especially for international tournaments).
  • "Tournament fatigue" — extra-time / penalty-shootout cost on the next match.
  • Neutral-venue vs home-tournament indicator.

20. Shot-occurrence and pre-shot models (modeling building block)

Not directly a published target, but increasingly important as the upstream of player-level and totals models.

What's modelled. P(shot occurs in this possession) (xS) decomposed alongside xG per shot (Mendes-Neves et al. 2025, "Beyond Expected Goals"). Provides separate signal for buildup quality and finishing quality, fed into downstream models with better calibration than per-90 averages.

Variables.

  • Possession-state features: location, defenders between ball and goal, time in possession, prior pass type.
  • Tracking-data features when available (pitch control, opponent-line height).
  • Player identity in possession (some players generate shots from possessions where most don't).

Cross-cutting variables glossary

A quick reference of the most-used features and where they come from. See data-sources.md and data-sources-matrix.md for full coverage.

FeatureDescriptionPrimary public source
Goals / goal differentialRealised match outcomesFootball-Data.co.uk, FBref
xG / xG-againstExpected goals from shot-quality modelsUnderstat (free), FBref, StatsBomb (paid)
PSxG (post-shot xG)xG conditional on shot trajectory; goalkeeper metricOpta, FBref, StatsBomb
xA / xT / VAEP / OBVAction-value / pass-value frameworksStatsBomb (event open data), academic open-source for VAEP/xT
Elo / SPI / pi-ratingLatent team-strength scalarsClubElo, eloratings.net (538/SPI archived)
Possession / PPDAPossession share, passes per defensive actionFBref, Whoscored
Touches in box / progressive carriesPlayer progression metricsFBref
Lineup / starting XIConfirmed lineups + benchSofascore, FotMob, official club feeds; press-conference scraping
Minutes playedPer-player season minutesFBref, Transfermarkt
Transfermarkt valuationCrowd-sourced player market valuetransfermarkt.com (scraped)
Referee identity + tendenciesPer-ref card, foul, penalty ratesFBref, Whoscored, Sofascore
Weather / pitch / venueMatch-day weather, surface, altitudeOpenWeather API, venue databases

Recent (2024–2026) directions worth flagging

  • xS × xG decomposition for player-level predictions (Mendes-Neves et al., 2025). Better calibration on shots/SoT predictions than per-90 baselines; outperforms logistic baselines in cross-validation.
  • Transformer / sequence models on event data for in-game outcome forecasting (arXiv:2511.18730, late 2025). Large-scale, multi-target (match + team + player); promising but unconfirmed at the top public-baseline bar.
  • Hybrid generalisable models for tournaments that combine player-level and team-level inputs (arXiv:2505.01902, 2025) — useful template for World Cup / Euros where pure team-form data is sparse.
  • Pi-rating + boosting ensembles are now a common 2024–25 baseline, replacing pure-rating or pure-Poisson papers from a decade ago. Reported numbers: RPS ≈ 0.20, accuracy ~52–56% on 1X2.
  • Hybrid frameworks (Frontiers, ITM Conferences 2025) emphasise stacking goal-process, rating, and discriminative models — each captures different residuals, and ensembles materially outperform single-family approaches on log-loss.
  • Generalising across leagues / tournaments: the 2025 literature increasingly tests cross-league generalisation explicitly; cross-league strength factors (a long-standing open question in player-quality.md) are starting to get learned-embedding treatment rather than scalar adjustments.

What this catalog does NOT settle

  • Which features add information beyond strong public baselines. Most of the variables listed above are also used (implicitly or explicitly) by the strongest published models. The research challenge is finding features and combinations that the public baseline doesn't yet capture — usually fast-moving (lineups, weather, late team news) or under-modelled (referee identity in card models, corner-process specifics).
  • Optimal feature combinations. Published ML-vs-Dixon-Coles head-to-heads disagree across papers; the result depends on league, time period, and feature set.
  • Lineup-aware vs results-only weighting. Open question (see team-modeling.md open list). Public lineup-aware football benchmarks remain scarce.

Key new references (additive to existing notes)

  • Mendes-Neves, T. et al. Beyond Expected Goals: A Probabilistic Framework for Shot Occurrences in Soccer. arXiv:2512.00203, 2025.
  • Yip, S. Forecasting number of corner kicks taken in association football using compound Poisson distribution. arXiv:2112.13001, 2021.
  • Wilkens, S. Can simple models predict football? Lessons from the German Bundesliga. SAGE, 2026.
  • Large-Scale In-Game Outcome Forecasting for Match, Team and Players in Football. arXiv:2511.18730, 2025.
  • From Players to Champions: A Generalizable Machine Learning Approach for Match Outcome Prediction with Insights from the FIFA World Cup. arXiv:2505.01902, 2025.
  • Predicting football match outcomes: a multilayer perceptron neural network model based on technical statistics indicators of the FIFA World Cup. Frontiers in Sports and Active Living, 2025.