LLMs flunk season‑long football betting

Major large language models performed poorly when used to predict Premier League scores across a season, with ‘AI punters’ losing heavily in practice. The Financial Times found that general‑purpose LLMs struggled to match structured, domain‑specific football models and cautioned against treating LLMs as standalone forecasting tools. This is a reminder that structured features and explicit uncertainty often beat generic language models for sports forecasting. (ft.com)

A big language model can talk confidently about a football match, but the Financial Times found that confidence did not translate into profits when it was asked to predict Premier League scores across a full season and make betting picks. In the paper exercise, the “artificial intelligence punters” lost heavily, while more traditional football models held up better. (ft.com) That result is less strange than it sounds, because betting is not a quiz where you just need the right answer more often than not. A bookmaker builds a margin into the odds, so a model has to be better than the price on offer, not just sound plausible in text. (pinnacleoddsdropper.com) Football prediction models usually start with structured inputs like shots, goals, expected goals, home advantage, injuries, and recent form. The official Premier League statistics pages now publish club-level data including goals, passes, clean sheets, shots, and expected goals, which is the kind of table-based input those models are built around. (premierleague.com 1) (premierleague.com 2) Expected goals is the key idea here, and it is simpler than it sounds. It gives every shot a probability of becoming a goal, so a close-range tap-in counts for more than a 30-yard blast, which helps a model separate luck from repeatable performance. (premierleague.com) (footystats.org) A general-purpose language model is built for next-word prediction, not for estimating whether Arsenal have a 47 percent or 52 percent chance of winning away on a wet Sunday. It can summarize injury news and tactics, but that is different from producing calibrated probabilities that survive 380 matches and a bookmaker’s margin. (arxiv.org) (ar5iv.labs.arxiv.org) That calibration point matters more than most people realize. Research on sports betting models has found that a well-calibrated forecast, where 60 percent events happen about 60 percent of the time, is more useful for profit than raw hit rate alone. (ar5iv.labs.arxiv.org) (researchgate.net) The Financial Times test is really a reminder that language and forecasting are different jobs. One system is good at turning messy information into readable prose, while the other is good at squeezing signal out of columns of numbers and admitting uncertainty when the edge is tiny. (ft.com) (arxiv.org) That does not make large language models useless in sport. It means they work better as assistants that collect team news, summarize injuries, or package model outputs into plain English than as standalone tipsters picking exact scores from a prompt. (ft.com) (github.com) The broader lesson reaches beyond football. When a task depends on explicit probabilities, stable inputs, and knowing when you are unsure, a smaller model built for that narrow job can beat a much larger model built to sound generally intelligent. (arxiv.org) (developers.openai.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.