The quality of any prediction depends on the data behind it. This page maps every data source we use — from free public archives to commercial feeds — and explains what each one provides.
Tiers of football data
Roughly, in increasing order of richness, cost, and access friction:
- Results — every match, full coverage, free or near-free.
- Aggregated stats (shots, possession, xG totals) — public for top leagues.
- Event data (per-action rows: passes, dribbles, tackles, shots) — partly free at sample scale, paid at production scale.
- Tracking data (25 Hz positions of all 22 players + ball) — mostly paid, broadcast-derived options narrowing the gap.
- 360 / freeze-frames (off-ball player positions at the moment of each event) — middle ground, partial access via StatsBomb Open Data.
- Lineups, injuries, late team news — distributed across feeds, tightly time-sensitive, the bottleneck for any lineup-aware model.
Results
- Football-Data.co.uk (Joseph Buchdahl). Free CSV downloads of results for top European leagues going back to the late 1990s. The standard starter dataset for academic and amateur work.
- FBref (Sports Reference). Free aggregated results and match metadata. Per-match per-player aggregates. Coverage broad and historical.
- Wikipedia / Wikidata. Free, structured results for international tournaments and historical seasons. Good for back-filling pre-2000 data and national-team results.
For a multi-season top-5-league results dataset, Football-Data.co.uk and FBref together cover effectively all needs at zero cost.
Event data
- StatsBomb Open Data. Free, well-documented JSON. Includes recent and historical World Cups (men's and women's), Euros, Champions League finals, NWSL seasons, Premier League 2003–04 (Arsenal Invincibles), and a rotating selection. Crucially, includes 360 (freeze-frame) data on a subset. Best free event dataset by quality and documentation.
- Wyscout open dataset (Pappalardo et al., 2019). Free, archived. Covers Big-5 leagues 2017–18 plus FIFA World Cup 2018 and Euro 2016. Older and limited, but a full season per league is enough for many baselines.
- Understat (
understat.com). Free, scrapeable. Shot-level data with their xG model for the top-5 European leagues and Russian Premier from ~2014 onward. The single easiest source of multi-season xG-level data; community scrapers (e.g.understatapi,understatPython package) abound. - FBref (Sports Reference). Free aggregated stats sourced from Opta/StatsPerform, via the
soccerdataPython library or direct scraping. Per-match per-player aggregates, not raw events. Coverage broad and historical. - Opta / Stats Perform (paid). The dominant commercial event-data provider. Per-feed pricing in the four to six figures depending on coverage.
- StatsBomb (paid). Higher-quality event tagging than Opta in some categories, plus 360 frames. Per-competition pricing; academic and partnership programmes exist.
- InStat (acquired by Hudl). Paid, lower-tier-league specialty.
Tracking data
- SkillCorner. Broadcast-derived (computer-vision on TV feed). Limited to players visible on screen — coverage of off-screen players is interpolated. Cheaper than full-pitch optical tracking; available for many leagues. Partnership and research programmes for academic access.
- Second Spectrum. Optical, in-stadium cameras. Premier League, La Liga partnerships. Paid; not generally accessible to outside researchers.
- Stats Perform / ChyronHego TRACAB. Optical, in-stadium. Bundesliga, MLS, others. Paid.
- Metrica Sports. Provides tracking software and has released anonymised tracking samples (a handful of full matches) — useful for prototyping pitch-control / EPV code, not for modelling at scale.
- Hawk-Eye Innovations. Now used by the Premier League for VAR and semi-automated offside; some derived feeds available to clubs.
- PFF FC (Pro Football Focus, US). Newer entrant, broadcast tracking + event tagging, partnership-based access.
The practical state: for non-club, non-vendor parties, tracking data is the binding constraint. Sample tracking (Metrica, occasional academic releases) is enough to build pipelines; full-season tracking for a top league is generally not affordable without a commercial deal.
Player and squad metadata
- Transfermarkt. Free, scrapeable. The de facto public player-valuation database, covering market value, transfers, contract length, and squad lineage. Lineup pages provide historical XIs and minutes. Crowd-sourced valuations have been shown to correlate strongly with realised transfer fees.
- FBref. Per-player season aggregates, age, nationality, foot, position. Free.
- FotMob. Free in-app data including projected lineups, in-game events, top-stat highlights. Less scrape-friendly; community APIs exist.
- Sofascore. Free, broad lineup and stat coverage. Live updates close to kickoff. Aggressive bot detection.
- Wikidata / DBpedia. Player biographic linkage, useful for cross-referencing IDs across sources.
- Whoscored (Opta-derived). Match ratings (0–10), heatmaps, per-player aggregates. Useful as a feature; the rating system is opaque.
- Capology. Salary and contract data for the major leagues. Useful as a long-horizon prior on squad value.
Context data
- Weather: OpenWeatherMap, Visual Crossing, Meteostat — free or low-cost historical APIs by lat/long. Match kickoff-time weather is straightforward to backfill.
- Referee: WhoScored and FBref carry referee assignments and per-referee aggregates (cards, penalties). Some private databases carry per-referee leaning by team.
- Stadium / venue: pitch dimensions, altitude, surface (natural / hybrid / artificial). Wikipedia is the most-complete free source.
- Travel distance: derivable from venue lat/longs; congestion of fixtures from result calendars.
- Injury and team news: club official sites, PhysioRoom, Premier Injuries, Sky Sports News. Real-time scraping is fragile; commercial feeds (Stats Perform News, Opta News) exist for production use.
Open data clients and helpers
socceraction(KU Leuven) — companion library for VAEP, with ready event-data adapters for StatsBomb, Wyscout, Opta.mplsoccer— Python pitch-plotting and model-prep utilities, well-integrated with StatsBomb open data.soccerdata— Python wrapper for FBref, Sofascore, Understat, ESPN, ClubElo, Match History, WhoScored. Single most-useful library for stitching free sources together.kloppy— standardises event-data formats across providers (StatsBomb, Wyscout, TRACAB, Sportec) into a common schema.statsbombpy— official StatsBomb open-data Python client.
Data quality and pitfalls
- Provider-specific xG models: Opta xG ≠ StatsBomb xG ≠ Understat xG. Absolute values are not comparable; ranks within a sample mostly are.
- Event-data tagging differences: what counts as a "shot blocked," a "key pass," or a "successful tackle" varies by provider. Cross-source feature engineering needs reconciliation.
- Lineup timing: official starting XIs are released ~60 minutes pre-kickoff. Modelling pre-news vs post-news requires keeping versioned lineup snapshots.
- Player-ID linkage: Transfermarkt IDs ≠ Opta IDs ≠ FBref IDs. Maintain a master player table with cross-IDs; tools like Wikidata QIDs help.
- Match-ID linkage: ditto, across event/tracking sources. Date + competition + home + away usually sufficient if timezone-aligned.
- Backfill bias: revising xG models retroactively and applying them to old shots inflates apparent accuracy. Backtests must use the model that was available at the time.
Cheapest viable starter stack
For a 3-season top-5-league baseline targeting 1X2 and goal-distribution predictions:
- Football-Data.co.uk for results
- Understat for shot-level xG
- FBref + Transfermarkt (via
soccerdata) for squad / valuation features - StatsBomb Open Data for sample event/360 for tournament-only models
- OpenWeatherMap historical for weather
- ClubElo for team-strength priors
Total cost: zero in licence fees, weeks of cleanup engineering. This is the floor that any modelling effort should be able to reach.
Open / unresolved questions
- Cheapest path to multi-season tracking — is partnering with SkillCorner or PFF FC for academic/personal access realistic at amateur scale?
- Closest-to-real-time XI feed for free or low cost — official club socials are fastest but unstructured; aggregators lag.
- Quality of Wyscout vs StatsBomb open data on overlapping competitions — for the few cross-covered tournaments, useful to quantify the disagreement.
- Whether Understat's xG is good enough as a finishing-skill measure — its model is a logistic baseline; the gap to StatsBomb's xG with body-part and pressure features is non-trivial in some samples.
Key references
Primary data sources mentioned above — direct links and provider notes:
- Football-Data.co.uk: https://www.football-data.co.uk/data.php (free results CSVs by league/season).
- Understat: https://understat.com/ (shot-level xG; community Python clients e.g.
understatapi). - StatsBomb Open Data (event + 360 frames, JSON).
- FBref: https://fbref.com/ (Sports Reference; Cloudflare-protected, scrape with
soccerdata). - Transfermarkt: https://www.transfermarkt.com/ (squad / market-value scrape, ToS-restricted).
- Sofascore: https://www.sofascore.com/ (aggressive bot detection).
- WhoScored: https://www.whoscored.com/ (Opta-derived ratings).
- ClubElo: http://clubelo.com/ (per-club Elo time-series, free CSVs).
- StatsBomb whitepapers and methodology: https://statsbomb.com/articles/category/whitepapers/.
Open-source clients and reconciliation helpers:
soccerdata.socceraction(VAEP + provider adapters).kloppy(event-data format normalisation).worldfootballR_data(FBref / Transfermarkt mirror, archived 2025-09 — historical baseline since FBref now Cloudflare-blocks server scraping).
Methodology:
- Anderson, C. & Sally, D. The Numbers Game. 2013. Accessible primer on football data quality and provider differences.