Where our data comes from

The quality of any prediction depends on the data behind it. This page maps every data source we use — from free public archives to commercial feeds — and explains what each one provides.

Tiers of football data

Roughly, in increasing order of richness, cost, and access friction:

Results — every match, full coverage, free or near-free.
Aggregated stats (shots, possession, xG totals) — public for top leagues.
Event data (per-action rows: passes, dribbles, tackles, shots) — partly free at sample scale, paid at production scale.
Tracking data (25 Hz positions of all 22 players + ball) — mostly paid, broadcast-derived options narrowing the gap.
360 / freeze-frames (off-ball player positions at the moment of each event) — middle ground, partial access via StatsBomb Open Data.
Lineups, injuries, late team news — distributed across feeds, tightly time-sensitive, the bottleneck for any lineup-aware model.

Results

Football-Data.co.uk (Joseph Buchdahl). Free CSV downloads of results for top European leagues going back to the late 1990s. The standard starter dataset for academic and amateur work.
FBref (Sports Reference). Free aggregated results and match metadata. Per-match per-player aggregates. Coverage broad and historical.
Wikipedia / Wikidata. Free, structured results for international tournaments and historical seasons. Good for back-filling pre-2000 data and national-team results.

For a multi-season top-5-league results dataset, Football-Data.co.uk and FBref together cover effectively all needs at zero cost.

Event data

StatsBomb Open Data. Free, well-documented JSON. Includes recent and historical World Cups (men's and women's), Euros, Champions League finals, NWSL seasons, Premier League 2003–04 (Arsenal Invincibles), and a rotating selection. Crucially, includes 360 (freeze-frame) data on a subset. Best free event dataset by quality and documentation.
Wyscout open dataset (Pappalardo et al., 2019). Free, archived. Covers Big-5 leagues 2017–18 plus FIFA World Cup 2018 and Euro 2016. Older and limited, but a full season per league is enough for many baselines.
Understat (understat.com). Free, scrapeable. Shot-level data with their xG model for the top-5 European leagues and Russian Premier from ~2014 onward. The single easiest source of multi-season xG-level data; community scrapers (e.g. understatapi, understat Python package) abound.
FBref (Sports Reference). Free aggregated stats sourced from Opta/StatsPerform, via the soccerdata Python library or direct scraping. Per-match per-player aggregates, not raw events. Coverage broad and historical.
Opta / Stats Perform (paid). The dominant commercial event-data provider. Per-feed pricing in the four to six figures depending on coverage.
StatsBomb (paid). Higher-quality event tagging than Opta in some categories, plus 360 frames. Per-competition pricing; academic and partnership programmes exist.
InStat (acquired by Hudl). Paid, lower-tier-league specialty.

Tracking data

SkillCorner. Broadcast-derived (computer-vision on TV feed). Limited to players visible on screen — coverage of off-screen players is interpolated. Cheaper than full-pitch optical tracking; available for many leagues. Partnership and research programmes for academic access.
Second Spectrum. Optical, in-stadium cameras. Premier League, La Liga partnerships. Paid; not generally accessible to outside researchers.
Stats Perform / ChyronHego TRACAB. Optical, in-stadium. Bundesliga, MLS, others. Paid.
Metrica Sports. Provides tracking software and has released anonymised tracking samples (a handful of full matches) — useful for prototyping pitch-control / EPV code, not for modelling at scale.
Hawk-Eye Innovations. Now used by the Premier League for VAR and semi-automated offside; some derived feeds available to clubs.
PFF FC (Pro Football Focus, US). Newer entrant, broadcast tracking + event tagging, partnership-based access.

The practical state: for non-club, non-vendor parties, tracking data is the binding constraint. Sample tracking (Metrica, occasional academic releases) is enough to build pipelines; full-season tracking for a top league is generally not affordable without a commercial deal.

Player and squad metadata

Transfermarkt. Free, scrapeable. The de facto public player-valuation database, covering market value, transfers, contract length, and squad lineage. Lineup pages provide historical XIs and minutes. Crowd-sourced valuations have been shown to correlate strongly with realised transfer fees.
FBref. Per-player season aggregates, age, nationality, foot, position. Free.
FotMob. Free in-app data including projected lineups, in-game events, top-stat highlights. Less scrape-friendly; community APIs exist.
Sofascore. Free, broad lineup and stat coverage. Live updates close to kickoff. Aggressive bot detection.
Wikidata / DBpedia. Player biographic linkage, useful for cross-referencing IDs across sources.
Whoscored (Opta-derived). Match ratings (0–10), heatmaps, per-player aggregates. Useful as a feature; the rating system is opaque.
Capology. Salary and contract data for the major leagues. Useful as a long-horizon prior on squad value.

Context data

Weather: OpenWeatherMap, Visual Crossing, Meteostat — free or low-cost historical APIs by lat/long. Match kickoff-time weather is straightforward to backfill.
Referee: WhoScored and FBref carry referee assignments and per-referee aggregates (cards, penalties). Some private databases carry per-referee leaning by team.
Stadium / venue: pitch dimensions, altitude, surface (natural / hybrid / artificial). Wikipedia is the most-complete free source.
Travel distance: derivable from venue lat/longs; congestion of fixtures from result calendars.
Injury and team news: club official sites, PhysioRoom, Premier Injuries, Sky Sports News. Real-time scraping is fragile; commercial feeds (Stats Perform News, Opta News) exist for production use.

Open data clients and helpers

socceraction (KU Leuven) — companion library for VAEP, with ready event-data adapters for StatsBomb, Wyscout, Opta.
mplsoccer — Python pitch-plotting and model-prep utilities, well-integrated with StatsBomb open data.
soccerdata — Python wrapper for FBref, Sofascore, Understat, ESPN, ClubElo, Match History, WhoScored. Single most-useful library for stitching free sources together.
kloppy — standardises event-data formats across providers (StatsBomb, Wyscout, TRACAB, Sportec) into a common schema.
statsbombpy — official StatsBomb open-data Python client.

Data quality and pitfalls

Provider-specific xG models: Opta xG ≠ StatsBomb xG ≠ Understat xG. Absolute values are not comparable; ranks within a sample mostly are.
Event-data tagging differences: what counts as a "shot blocked," a "key pass," or a "successful tackle" varies by provider. Cross-source feature engineering needs reconciliation.
Lineup timing: official starting XIs are released ~60 minutes pre-kickoff. Modelling pre-news vs post-news requires keeping versioned lineup snapshots.
Player-ID linkage: Transfermarkt IDs ≠ Opta IDs ≠ FBref IDs. Maintain a master player table with cross-IDs; tools like Wikidata QIDs help.
Match-ID linkage: ditto, across event/tracking sources. Date + competition + home + away usually sufficient if timezone-aligned.
Backfill bias: revising xG models retroactively and applying them to old shots inflates apparent accuracy. Backtests must use the model that was available at the time.

Cheapest viable starter stack

For a 3-season top-5-league baseline targeting 1X2 and goal-distribution predictions:

Football-Data.co.uk for results
Understat for shot-level xG
FBref + Transfermarkt (via soccerdata) for squad / valuation features
StatsBomb Open Data for sample event/360 for tournament-only models
OpenWeatherMap historical for weather
ClubElo for team-strength priors

Total cost: zero in licence fees, weeks of cleanup engineering. This is the floor that any modelling effort should be able to reach.

Open / unresolved questions

Cheapest path to multi-season tracking — is partnering with SkillCorner or PFF FC for academic/personal access realistic at amateur scale?
Closest-to-real-time XI feed for free or low cost — official club socials are fastest but unstructured; aggregators lag.
Quality of Wyscout vs StatsBomb open data on overlapping competitions — for the few cross-covered tournaments, useful to quantify the disagreement.
Whether Understat's xG is good enough as a finishing-skill measure — its model is a logistic baseline; the gap to StatsBomb's xG with body-part and pressure features is non-trivial in some samples.

Key references

Primary data sources mentioned above — direct links and provider notes:

Football-Data.co.uk: https://www.football-data.co.uk/data.php (free results CSVs by league/season).
Understat: https://understat.com/ (shot-level xG; community Python clients e.g. understatapi).
StatsBomb Open Data (event + 360 frames, JSON).
FBref: https://fbref.com/ (Sports Reference; Cloudflare-protected, scrape with soccerdata).
Transfermarkt: https://www.transfermarkt.com/ (squad / market-value scrape, ToS-restricted).
Sofascore: https://www.sofascore.com/ (aggressive bot detection).
WhoScored: https://www.whoscored.com/ (Opta-derived ratings).
ClubElo: http://clubelo.com/ (per-club Elo time-series, free CSVs).
StatsBomb whitepapers and methodology: https://statsbomb.com/articles/category/whitepapers/.

Open-source clients and reconciliation helpers:

soccerdata.
socceraction (VAEP + provider adapters).
kloppy (event-data format normalisation).
worldfootballR_data (FBref / Transfermarkt mirror, archived 2025-09 — historical baseline since FBref now Cloudflare-blocks server scraping).

Methodology:

Anderson, C. & Sally, D. The Numbers Game. 2013. Accessible primer on football data quality and provider differences.