Grok flunks sports‑bet test

Research reported that xAI’s Grok performed worst among mainstream chatbots in a sports‑betting simulation, losing its entire notional bankroll over a soccer season (pcmag.com). The finding was presented as a narrow benchmark rather than a broad verdict on the model’s capabilities across other tasks (pcmag.com).

A new benchmark that made chatbots bet on English Premier League matches found Grok finished last and burned through its entire notional bankroll. (gr.inc) General Reasoning released the test, called KellyBench, on April 9, 2026. It put eight language models into a simulated 2023–24 Premier League betting market with a normalized starting bankroll of £100,000 and asked them to grow it over a full season. (gr.inc) The setup was not a simple pick-the-winner quiz. The models got historical team data, lineups, past results, advanced statistics, and public odds, then had to build prediction systems, decide when they had an edge, size bets, and manage risk over time. (gr.inc) Grok 4.20 posted a mean return on investment of negative 100.0% across three runs and an average final bankroll of £0. General Reasoning said Grok lost all its money in one run and did not finish the other two, which still counted as total losses in the benchmark. (gr.inc) (pcmag.com) The result did not mean the other chatbots beat the market. General Reasoning said every frontier model it tested lost money over the season, and many hit ruin, with Claude Opus 4.6 doing best at negative 11.0% on average and OpenAI’s GPT-5.4 next at negative 13.6%. (gr.inc) That makes the test more about long-run decision-making than about soccer fandom. The paper says betting markets force a model to turn predictions into actions under uncertainty, then adjust when the environment changes, which is different from solving a fixed benchmark with one right answer. (gr.inc) General Reasoning also scored how the models behaved, not just whether they won. It used a 44-point rubric built with quantitative betting fund experts and said Grok’s mean “sophistication” score was 9.8%, far below Claude Opus 4.6 at 32.6% and GPT-5.4 at 31.8%. (gr.inc) The company framed KellyBench as a narrow evaluation, not a verdict on every use of a chatbot. Ross Taylor, General Reasoning’s chief executive, told PCMag that much of current artificial intelligence testing still happens in “very static environments” rather than long-horizon settings closer to real operations. (pcmag.com) The paper also says these runs were expensive and lengthy. The evaluated agents used roughly 500 to 900 tool calls per episode and 32 million to 450 million tokens, and one GPT-5.4 run cost $2,012 to complete. (gr.inc) So the cleanest takeaway is narrower than the headline. In this one replay of the 2023–24 Premier League season, none of the tested models beat the market, and Grok was the only mainstream chatbot to finish with an average bankroll of zero. (gr.inc)

Grok flunks sports‑bet test

Get your own daily briefing