Study finds 26.5–54% tool mismatch
- University of Maryland-affiliated researchers posted a May 20 explainer of their preprint showing large language models often misjudge when external tools are needed. - The paper reported mismatch rates of 26.5% to 54.0% in arithmetic and 30.8% to 41.8% in factual question answering. - The preprint is on arXiv, and code plus datasets are available in the authors’ GitHub repository.
Yize Cheng and four co-authors said a new study found that large language models often fail to align tool use with actual need, even when their internal representations appear to recognize that a tool is required. The paper, “Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use,” was submitted to arXiv on May 13 and was being circulated by the authors on May 20. The researchers tested four open-weight models on arithmetic and factual question-answering tasks and measured how often tool calls matched what each model empirically needed to solve a prompt. They reported mismatch rates of 26.5% to 54.0% on arithmetic and 30.8% to 41.8% on factual QA. ### Why did the authors say “tool necessity” should depend on the model? The paper argues that prior work often treated tool necessity as a fixed property of a query, using human labels or a stronger model’s judgment. The authors said that approach misses a practical point: a stronger model may solve a task unaided, while a weaker one may need a calculator, search engine or other tool for the same prompt. (arxiv.org) The GitHub repository defines a tool as necessary for a specific model when that model “cannot reliably solve a query consistently without it.” The paper says its necessity labels are grounded in each model’s empirical performance rather than a model-agnostic annotation scheme. ### Which models and tasks were tested? (arxiv.org) The repository says the experiments covered Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Qwen3-8B and Qwen3-4B. The two reported domains were arithmetic and factual QA, with the repository naming TruthfulQA for the factual benchmark. The arXiv abstract says the comparison was made across four models on arithmetic and factual QA datasets. (github.com) The reported mismatch ranges varied by model and task, rather than clustering around a single number. ### What exactly was mismatching? The central comparison in the paper is between necessity and behavior: whether a tool was empirically needed for a given model, and whether the model actually called one. (github.com) The authors said those two did not line up in a substantial share of cases. The repository organizes outcomes into four buckets: necessary_called, necessary_Notcalled, unnecessary_called and unnecessary_Notcalled. (arxiv.org) That setup captures both underuse, where a model should have called a tool but did not, and overuse, where it called a tool when one was not needed. ### Where does the paper say the failure happens inside the model? (arxiv.org) The authors split tool use into two stages: a cognition stage, reflecting whether a model internally treats a tool as necessary, and an execution stage, reflecting whether it actually makes the tool-call action. The arXiv abstract says both signals were often linearly decodable from hidden states. (github.com) The same abstract says the probe directions for cognition and action became “nearly orthogonal” in late layers at the last-token stage that drives next-token output. The authors said tracing sample trajectories showed most mismatch was concentrated in the transition from cognition to action, not in cognition itself. ### Why does that matter for agent design? (arxiv.org) The paper frames the result as a “knowing-doing gap” in tool use. In the authors’ account, the issue is not only whether a model can represent that it needs help, but whether that representation reliably turns into a tool call. That distinction matters for systems built around search, calculators and APIs, because a benchmark that checks only final answers may miss failures in the decision layer that chooses whether to use a tool at all. (arxiv.org) The authors said improving reliability will require better translation of internal recognition into action. ### Where can readers inspect the paper and data? (arxiv.org) ArXiv lists the paper as arXiv:2605.14038 and shows Yize Cheng, Chenrui Fan, Mahdi JafariRaviz, Keivan Rezaei and Soheil Feiz as authors. The submission date shown on the abstract page is May 13, 2026. The GitHub repository, titled Tool-Cognition-Action, says it contains the code and data used in the paper, including raw data split by model, domain and outcome category. (arxiv.org) The repository also says model generations with and without tool calls are saved alongside the necessity labels. (github.com)