What it takes to beat an animal oracle

In the summer of 2010, the most talked-about forecaster at the World Cup was a common octopus in a tank in Oberhausen, Germany. Paul lived at a Sea Life centre, and ahead of each match he was offered two clear boxes of food, one marked with each team's flag. Whichever box he opened first counted as his call. He called all eight of the matches he was offered correctly that summer, including the final between Spain and the Netherlands. Two years earlier, at Euro 2008, he had been right in four of his six Germany matches.

People still bring Paul up every tournament. So it is worth being precise about what he actually did, and why it is the wrong thing to envy.

Eight in a row is a coin landing right eight times

Paul's boxes gave him a two-way choice: one team or the other, no draw. Eight correct two-way calls in a row is the same event as a fair coin landing the same way eight times. The chance of that, if you are guessing, is one half multiplied by itself eight times, or about one in 256.

One in 256 sounds rare until you count how many animals were guessing, in that tournament and every one since. When a crowd of them each take a run at a short streak, the chance that at least one goes perfect stops being one in 256 and climbs toward a coin flip of its own. Paul became the one we remember. He was not the only one playing.

The rival nobody remembers

Paul had a direct rival that summer: Mani, a parakeet at a fortune-teller's stall in Singapore, who made his calls by hopping out of his cage and choosing between two flag-marked cards. Mani got all four 2010 quarter-finals right, including the Netherlands knocking Brazil out, the upset that made his name. Then he split the semi-finals, and for the final he chose the Netherlands. Spain won. Mani finished the knockouts five correct out of seven, and almost nobody brings him up.

That is selection working on us rather than foresight working through an animal. Two creatures were hyped side by side. One went perfect and became a legend; the other missed the final and dropped out of the story. With enough animals having a go, somebody's pet is close to guaranteed to land a clean streak, and we keep the winner while forgetting the rest.

The pattern repeats every tournament. In 2018 a deaf cat named Achilles, one of the resident rat-catchers at the Hermitage Museum in St Petersburg, chose between two flag-marked bowls of food and opened with a strong run before tailing off into misses. Despite how the story usually gets retold, he never forecast the final, which was played in Moscow and well outside his beat. A camel, a sea turtle, elephants and a falcon have all taken their turn since. None has matched Paul's eight from eight, which is exactly what you would expect if that streak was luck all along.

An oracle can't tell you how sure it was

Here is the deeper limitation, and it has nothing to do with luck. When Paul opened a box, that was the entire output: one team, full stop. He could not tell you that he was almost certain about Germany against Argentina and barely past a coin-flip on Spain against Paraguay. Every call carried the same weight, because the only thing on offer was a yes or a no.

A statistical model works the other way around. Its output is a number on every outcome — a chance the match is won, drawn, or lost, a chance each team lifts the trophy. Some of those numbers are confident and some sit close to even, and the model commits to the difference in advance.

That difference is the whole point, because it is what makes the model checkable. A single winner can only ever be marked right or wrong after the fact. A probability can be tested against reality in a way a yes/no call never can: gather every match where the model said roughly 70%, and ask whether the favoured side actually came through about 70% of the time. When the answer lines up, the model is calibrated. When it does not, you can see exactly where and by how much. There is no equivalent test you could ever have run on the octopus.

What "beating the oracle" actually means

So suppose you wanted to beat Paul's 2010 run. On those specific eight matches, you could not — not honestly. A good model would have said some of those games were genuinely close, maybe 55 to 45, because they were. Calling a close match 55/45 is the correct answer, and it still costs you half the time. Eight-from-eight by a creature that only ever commits to one side, on a sample that short, is essentially unbeatable by anyone telling the truth about uncertainty.

That is the trap. Over eight matches, the forecaster who refuses to admit doubt and gets lucky will beat the one who is honest about it. Luck is loudest in small samples.

Stretch the test out, though, and it inverts completely. Over sixty-four matches, or six hundred, the coin-flipper's streak runs out and the honest probabilities start to pay. The forecaster whose 70%s really land 70% of the time pulls clear of anyone who was guessing, because being right and well-calibrated across a long run is something luck cannot fake. Beating an oracle is not about a hot streak. It is about being the one still standing when the sample gets long enough to tell luck and skill apart.

This is also why a model has to get matches "wrong" to be working correctly. If it says a side is 60% to win, the other result is supposed to happen 40% of the time. A model that never looked wrong would be a model that only ever said 100% — which is exactly what an animal oracle does, and exactly why it cannot be trusted past a good story.

So we put the model on the same board

We take that seriously enough to grade ourselves the way we would grade any oracle. Our prediction leaderboard ranks forecasters — people and the model alike — on a proper scoring rule, over real matches, with every call settled automatically once the result is in. No single lucky streak carries a row up the table. Accuracy across the full sample does.

If you want to see how the model has actually held up rather than how it talks about itself, the calibration record lays out its tournament Brier score and reliability across hundreds of past matches, and the methodology shows how the probabilities are built in the first place.

Now it's your turn — and there's a badge in it

The board isn't only for the model. In Verified Calls you can call the result of upcoming matches yourself: each call is sealed the moment the match kicks off, stamped with a receipt, and graded automatically after full time. A few novelty animal oracles we track ride along on the very same board — their calls sealed before kickoff and scored exactly like yours.

So we made the obvious challenge a thing you can win. Out-call the leading animal oracle — across at least five graded calls of your own, so it is a record and not a coin-flip — and a Beat the animal oracles badge lands on your profile, yours to share.

Here is the honest catch, and it is the whole post in miniature. If an animal is mid-way through a flawless run, the bar sits at 100% and nobody can clear it without claiming a certainty they do not have. You cannot beat a perfect streak while it is still perfect. The moment it misses once — and over a long enough tournament it will — the bar drops, and the forecaster who has been honestly right starts to pull ahead. That is exactly when skill overtakes luck, turned into a game.

A psychic octopus is a wonderful tournament story. It is just not a forecast you could ever check. The only oracle worth beating is one that tells you how sure it is — and then lets you keep score.

The historical figures in this post (Paul's 2010 and 2008 records, Mani's 2010 knockout record, Achilles' 2018 run, and the one-in-256 streak probability) are matters of public record and simple arithmetic. The model's calibration figures are research outputs, described in full on the calibration record and methodology pages. They are for research and educational purposes only — not betting advice, not financial advice, not recommendations to gamble. The model can be wrong. Full Terms of Use.