DeepMind model matches superforecasters

- Google DeepMind said on May 24 its forecasting model matched expert superforecasters on multiple probabilistic benchmarks, publishing results, methods and notebooks online. - ForecastBench’s public site shows superforecasters at the top, with model comparisons reported using Brier scores and confidence intervals on nightly updated leaderboards. - ForecastBench says datasets, leaderboards and resolution values are updated nightly through its website and linked GitHub repositories.

Google DeepMind said on May 24 that one of its models matched expert superforecasters on multiple probabilistic forecasting benchmarks, according to a post shared on X and linked evaluation materials. The claim centers on Brier scores, a standard measure for judging probabilistic forecasts in which lower scores are better. DeepMind said it published the underlying methodology, confidence intervals and notebooks alongside the results. The materials point readers to public benchmark infrastructure rather than a closed internal test. Forecasting benchmarks have become a closely watched test for large language models because they ask systems to estimate the probability of real-world events before outcomes are known. ForecastBench, one of the main public references in this area, describes itself as a “dynamic, contamination-free benchmark” of AI forecasting accuracy with human comparison groups. Its paper says the benchmark was built around regularly updated questions about future events, specifically to avoid leakage from known answers. (forecastbench.org) ### What exactly is being compared here? ForecastBench compares model forecasts with human baselines including a public median forecast and a superforecaster median forecast. On its public site, the superforecaster median sits at the top of both the tournament and baseline views shown in current leaderboards. The benchmark’s published tables report performance using overall scores, Brier-based comparisons and 95% confidence intervals. The benchmark paper by Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang and Philip Tetlock says expert human forecasters outperformed the top-performing LLM in the paper’s original evaluation sample. (forecastbench.org) That makes DeepMind’s newer claim notable because it is framed as parity with a human reference group that earlier public results still showed ahead. ### What is a Brier score, and why does it matter? Brier scores measure the accuracy of probabilistic predictions by comparing forecast probabilities with eventual outcomes. (forecastbench.org) In practice, a lower Brier score means the forecast was better calibrated and more accurate. ForecastBench’s public human leaderboard shows a superforecaster median overall score of 0.093, with a 95% confidence interval of 0.073 to 0.112, while several model entries cluster above that level in the archived table. (arxiv.org) ForecastBench’s front-end also presents a “Brier Index (%)” chart, which converts raw benchmark performance into a more legible index for trend tracking. On that chart, the site projects LLM-superforecaster parity in May 2027, with a 95% confidence interval ranging from April 2026 to August 2028. DeepMind’s claim, if reproduced on the same framework, would indicate a model reaching that threshold earlier than the site’s central trendline. That is an inference from the benchmark’s projection and DeepMind’s stated result. (forecastbench.org) ### How public is the benchmark? ForecastBench says its datasets repository is updated nightly and that question sets are released every two weeks. The associated GitHub repository says the benchmark data are distributed under a CC BY-SA 4.0 license and links the codebase used for the benchmark. The site also says human comparison groups shown as reference lines were last surveyed in July 2024 and answered a different set of questions than models evaluated later. (forecastbench.org) To compare non-overlapping question sets, ForecastBench says it uses a two-way fixed-effects model that adjusts for question difficulty. That detail matters because any claim of parity depends on the adjustment method as well as the raw forecasts. ### Does this mean AI has clearly surpassed human forecasters? (forecastbench.org) ForecastBench’s current public pages do not show a DeepMind-branded entry in the top results surfaced by search, and the benchmark’s homepage still lists the superforecaster median first. Good Judgment, which works with superforecasters, also says superforecasters still lead in the benchmark results it cites from October 2025. What DeepMind appears to be claiming is narrower: that its model matched expert superforecasters on multiple benchmark sets under a published evaluation setup. (forecastbench.org) The next place to check is the benchmark website and linked GitHub materials, which ForecastBench says are updated nightly, as well as any full DeepMind paper or repository that follows the May 24 X post. (forecastbench.org)

DeepMind model matches superforecasters

Get your own daily briefing