Reasoning failures still haunt models
Recent analysis argues that AI systems can score well on reasoning benchmarks yet still fail at real tasks because they prefer to guess instead of asking for missing information. Researchers and commentators say that over‑optimising benchmark performance can hide a model’s tendency to hallucinate when uncertain. (medium.com, the-decoder.com)
Large language models still ace many logic and math tests, but recent research finds they often fail when a task is missing one crucial fact. (openreview.net) A reasoning benchmark usually gives the model every piece it needs, then grades the final answer as right or wrong. QuestBench, released in 2025 by researchers at Massachusetts Institute of Technology and Google DeepMind, instead removes one necessary variable and asks whether the model can identify the single clarifying question it needs to solve the task. (openreview.net) On that benchmark, current models did well on the grade-school-math sets but scored only 40% to 50% accuracy on the logic and planning sets. The authors wrote that models often failed to ask the right question even when they could solve the same problem once all the missing information was supplied. (openreview.net) That gap has pushed researchers to separate “reasoning” from “information gathering.” In a 2024 Stanford-led paper, researchers trained a model to ask better clarifying questions and reported that, after two rounds of self-improvement on 25,500 synthetic prompts, its responses were preferred over the initial model’s on 72% of tasks. (cicl.stanford.edu) The same pattern shows up in medicine, where missing details can change a diagnosis. A Neurips 2024 paper from researchers at the University of Washington, Carnegie Mellon University, Cornell Tech, and the Allen Institute for Artificial Intelligence said standard single-turn medical benchmarks diverge from real clinical conversations, and that adding question-asking with abstention strategies improved diagnostic accuracy by 22.3% on their MEDIQ setup. (proceedings.neurips.cc) The benchmark problem is older than this latest wave of papers. Stanford’s Holistic Evaluation of Language Models project, first published in 2022, argued that language models should be judged on multiple metrics, including calibration and robustness, instead of accuracy alone. (arxiv.org) That warning has since been echoed by model builders themselves. OpenAI wrote in a September 5, 2025 research note that standard training and evaluation procedures “reward guessing over acknowledging uncertainty,” and said scoreboards that rank systems mainly by accuracy can make a guessing model look better than a careful one that abstains. (openai.com) OpenAI gave a concrete example from its SimpleQA evaluation: GPT-5-thinking-mini had a 52% abstention rate and 22% accuracy rate, while OpenAI o4-mini had a 1% abstention rate and 24% accuracy rate. The company said errors are worse than abstentions, even when the lower-abstention model looks slightly better on raw accuracy. (openai.com) Researchers are now building tests around that tradeoff instead of treating it as noise. The newer benchmarks ask whether a model can stop, admit uncertainty, and request the missing detail before it commits to an answer that sounds complete but is wrong. (openreview.net, proceedings.neurips.cc) The thread running through these papers is simple: a model that guesses fast can still top a leaderboard, while a model that asks one good question may do better at the job people actually hand it. (openreview.net, openai.com)