The strongest public model in each category, plus the metric it claims and where the claim comes from. The point is to set a target: if our Phase 3 baseline can't match these, we don't have an edge yet.
Sample sizes, leagues, and time windows differ across these references — direct head-to-head comparisons are not always clean. Where comparison is muddy, the entry says so.