Behind the scenes

Where our data comes from

1,397 words

The quality of any prediction depends on the data behind it. This page maps every data source we use — from free public archives to commercial feeds — and explains what each one provides.

Tiers of football data

Roughly, in increasing order of richness, cost, and access friction:

  1. Results — every match, full coverage, free or near-free.
  2. Aggregated stats (shots, possession, xG totals) — public for top leagues.
  3. Event data (per-action rows: passes, dribbles, tackles, shots) — partly free at sample scale, paid at production scale.
  4. Tracking data (25 Hz positions of all 22 players + ball) — mostly paid, broadcast-derived options narrowing the gap.
  5. 360 / freeze-frames (off-ball player positions at the moment of each event) — middle ground, partial access via StatsBomb Open Data.
  6. Lineups, injuries, late team news — distributed across feeds, tightly time-sensitive, the bottleneck for any lineup-aware model.

Results

  • Football-Data.co.uk (Joseph Buchdahl). Free CSV downloads of results for top European leagues going back to the late 1990s. The standard starter dataset for academic and amateur work.
  • FBref (Sports Reference). Free aggregated results and match metadata. Per-match per-player aggregates. Coverage broad and historical.
  • Wikipedia / Wikidata. Free, structured results for international tournaments and historical seasons. Good for back-filling pre-2000 data and national-team results.

For a multi-season top-5-league results dataset, Football-Data.co.uk and FBref together cover effectively all needs at zero cost.

Event data

  • StatsBomb Open Data. Free, well-documented JSON. Includes recent and historical World Cups (men's and women's), Euros, Champions League finals, NWSL seasons, Premier League 2003–04 (Arsenal Invincibles), and a rotating selection. Crucially, includes 360 (freeze-frame) data on a subset. Best free event dataset by quality and documentation.
  • Wyscout open dataset (Pappalardo et al., 2019). Free, archived. Covers Big-5 leagues 2017–18 plus FIFA World Cup 2018 and Euro 2016. Older and limited, but a full season per league is enough for many baselines.
  • Understat (understat.com). Free, scrapeable. Shot-level data with their xG model for the top-5 European leagues and Russian Premier from ~2014 onward. The single easiest source of multi-season xG-level data; community scrapers (e.g. understatapi, understat Python package) abound.
  • FBref (Sports Reference). Free aggregated stats sourced from Opta/StatsPerform, via the soccerdata Python library or direct scraping. Per-match per-player aggregates, not raw events. Coverage broad and historical.
  • Opta / Stats Perform (paid). The dominant commercial event-data provider. Per-feed pricing in the four to six figures depending on coverage.
  • StatsBomb (paid). Higher-quality event tagging than Opta in some categories, plus 360 frames. Per-competition pricing; academic and partnership programmes exist.
  • InStat (acquired by Hudl). Paid, lower-tier-league specialty.

Tracking data

  • SkillCorner. Broadcast-derived (computer-vision on TV feed). Limited to players visible on screen — coverage of off-screen players is interpolated. Cheaper than full-pitch optical tracking; available for many leagues. Partnership and research programmes for academic access.
  • Second Spectrum. Optical, in-stadium cameras. Premier League, La Liga partnerships. Paid; not generally accessible to outside researchers.
  • Stats Perform / ChyronHego TRACAB. Optical, in-stadium. Bundesliga, MLS, others. Paid.
  • Metrica Sports. Provides tracking software and has released anonymised tracking samples (a handful of full matches) — useful for prototyping pitch-control / EPV code, not for modelling at scale.
  • Hawk-Eye Innovations. Now used by the Premier League for VAR and semi-automated offside; some derived feeds available to clubs.
  • PFF FC (Pro Football Focus, US). Newer entrant, broadcast tracking + event tagging, partnership-based access.

The practical state: for non-club, non-vendor parties, tracking data is the binding constraint. Sample tracking (Metrica, occasional academic releases) is enough to build pipelines; full-season tracking for a top league is generally not affordable without a commercial deal.

Player and squad metadata

  • Transfermarkt. Free, scrapeable. The de facto public player-valuation database, covering market value, transfers, contract length, and squad lineage. Lineup pages provide historical XIs and minutes. Crowd-sourced valuations have been shown to correlate strongly with realised transfer fees.
  • FBref. Per-player season aggregates, age, nationality, foot, position. Free.
  • FotMob. Free in-app data including projected lineups, in-game events, top-stat highlights. Less scrape-friendly; community APIs exist.
  • Sofascore. Free, broad lineup and stat coverage. Live updates close to kickoff. Aggressive bot detection.
  • Wikidata / DBpedia. Player biographic linkage, useful for cross-referencing IDs across sources.
  • Whoscored (Opta-derived). Match ratings (0–10), heatmaps, per-player aggregates. Useful as a feature; the rating system is opaque.
  • Capology. Salary and contract data for the major leagues. Useful as a long-horizon prior on squad value.

Context data

  • Weather: OpenWeatherMap, Visual Crossing, Meteostat — free or low-cost historical APIs by lat/long. Match kickoff-time weather is straightforward to backfill.
  • Referee: WhoScored and FBref carry referee assignments and per-referee aggregates (cards, penalties). Some private databases carry per-referee leaning by team.
  • Stadium / venue: pitch dimensions, altitude, surface (natural / hybrid / artificial). Wikipedia is the most-complete free source.
  • Travel distance: derivable from venue lat/longs; congestion of fixtures from result calendars.
  • Injury and team news: club official sites, PhysioRoom, Premier Injuries, Sky Sports News. Real-time scraping is fragile; commercial feeds (Stats Perform News, Opta News) exist for production use.

Open data clients and helpers

  • socceraction (KU Leuven) — companion library for VAEP, with ready event-data adapters for StatsBomb, Wyscout, Opta.
  • mplsoccer — Python pitch-plotting and model-prep utilities, well-integrated with StatsBomb open data.
  • soccerdata — Python wrapper for FBref, Sofascore, Understat, ESPN, ClubElo, Match History, WhoScored. Single most-useful library for stitching free sources together.
  • kloppy — standardises event-data formats across providers (StatsBomb, Wyscout, TRACAB, Sportec) into a common schema.
  • statsbombpy — official StatsBomb open-data Python client.

Data quality and pitfalls

  • Provider-specific xG models: Opta xG ≠ StatsBomb xG ≠ Understat xG. Absolute values are not comparable; ranks within a sample mostly are.
  • Event-data tagging differences: what counts as a "shot blocked," a "key pass," or a "successful tackle" varies by provider. Cross-source feature engineering needs reconciliation.
  • Lineup timing: official starting XIs are released ~60 minutes pre-kickoff. Modelling pre-news vs post-news requires keeping versioned lineup snapshots.
  • Player-ID linkage: Transfermarkt IDs ≠ Opta IDs ≠ FBref IDs. Maintain a master player table with cross-IDs; tools like Wikidata QIDs help.
  • Match-ID linkage: ditto, across event/tracking sources. Date + competition + home + away usually sufficient if timezone-aligned.
  • Backfill bias: revising xG models retroactively and applying them to old shots inflates apparent accuracy. Backtests must use the model that was available at the time.

Cheapest viable starter stack

For a 3-season top-5-league baseline targeting 1X2 and goal-distribution predictions:

  • Football-Data.co.uk for results
  • Understat for shot-level xG
  • FBref + Transfermarkt (via soccerdata) for squad / valuation features
  • StatsBomb Open Data for sample event/360 for tournament-only models
  • OpenWeatherMap historical for weather
  • ClubElo for team-strength priors

Total cost: zero in licence fees, weeks of cleanup engineering. This is the floor that any modelling effort should be able to reach.

Open / unresolved questions

  • Cheapest path to multi-season tracking — is partnering with SkillCorner or PFF FC for academic/personal access realistic at amateur scale?
  • Closest-to-real-time XI feed for free or low cost — official club socials are fastest but unstructured; aggregators lag.
  • Quality of Wyscout vs StatsBomb open data on overlapping competitions — for the few cross-covered tournaments, useful to quantify the disagreement.
  • Whether Understat's xG is good enough as a finishing-skill measure — its model is a logistic baseline; the gap to StatsBomb's xG with body-part and pressure features is non-trivial in some samples.

Key references

Primary data sources mentioned above — direct links and provider notes:

Open-source clients and reconciliation helpers:

  • soccerdata.
  • socceraction (VAEP + provider adapters).
  • kloppy (event-data format normalisation).
  • worldfootballR_data (FBref / Transfermarkt mirror, archived 2025-09 — historical baseline since FBref now Cloudflare-blocks server scraping).

Methodology:

  • Anderson, C. & Sally, D. The Numbers Game. 2013. Accessible primer on football data quality and provider differences.