Apple paper: models ignore step guidance
- Apple researchers said in a June 2025 paper that frontier reasoning models failed to reliably use explicit algorithms as task complexity increased. - The paper found three regimes under equal inference compute: standard models won on low-complexity tasks, reasoning models helped in medium ones, and both collapsed on hard ones. - The paper, “The Illusion of Thinking,” is on Apple Machine Learning Research and arXiv, with Apple authors including Parshin Shojaee and Samy Bengio.
Apple researchers used controllable puzzle environments to test whether large reasoning models actually follow multi-step procedures or mainly reproduce familiar patterns. In a June 2025 paper, the team reported that frontier reasoning models improved on some medium-complexity tasks but suffered what it called a “complete accuracy collapse” once complexity passed a threshold. Apple said the models also showed “limitations in exact computation” and “fail to use explicit algorithms,” even when the task structure stayed logically consistent. Those results came from experiments designed to inspect not only final answers but also the intermediate reasoning traces the models produced. ### What did Apple test that standard benchmarks often miss? Apple’s paper said common reasoning benchmarks often emphasize final-answer accuracy and can be distorted by data contamination, making it hard to tell whether a model is reasoning or recalling patterns. To get around that, the researchers built controllable puzzle environments where they could vary compositional complexity while keeping the underlying logic stable. That let them compare outputs across problem families instead of relying on one-off benchmark questions. (machinelearning.apple.com) The authors — Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio and Mehrdad Farajtabar — wrote that this setup allowed them to verify both final answers and intermediate steps. Apple said that made it possible to inspect how the models’ reasoning traces changed as tasks became harder. ### Where did the models help, and where did they fail? Apple reported three performance regimes when it compared large reasoning models with standard large language models under equivalent inference compute. (machinelearning.apple.com) In low-complexity tasks, standard models “surprisingly” outperformed reasoning models, the paper said. In medium-complexity tasks, the extra thinking steps used by reasoning models produced an advantage. In high-complexity tasks, both classes of models failed. The paper also said reasoning effort did not scale smoothly with harder problems. Apple wrote that the models’ reasoning effort increased with complexity “up to a point,” then declined even when token budget remained available. That pattern, the researchers said, appeared across diverse puzzles in their test setup. ### Why does the “explicit algorithms” line matter? Apple wrote that the models had “limitations in exact computation” and “fail to use explicit algorithms and reason inconsistently across scales and problems.” That is the line that has drawn attention because it cuts against the idea that giving a model a clearer step sequence will necessarily make it behave like a dependable procedural system. (machinelearning.apple.com) The paper does not say models never benefit from structured reasoning; it says the benefit breaks down as complexity rises. Apple’s earlier October 2024 GSM-Symbolic paper pointed in a similar direction. In that work, Apple researchers said model performance dropped when only numerical values changed and deteriorated sharply as extra clauses were added, including declines of up to 65% from a single added clause. The authors said those results suggested models were often replicating reasoning patterns from training data rather than performing stable logical reasoning. (machinelearning.apple.com) ### So what should readers take from this paper? Apple’s paper is narrower than the online reaction to it. The researchers did not claim all reasoning traces are useless, and they did not argue that reasoning models never outperform standard models. They reported a more specific result: chain-of-thought-style reasoning helps in a middle band of difficulty, but it does not reliably turn current models into algorithm-following systems on harder tasks. (machinelearning.apple.com) The next reference point is the paper itself. Apple has posted “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” on its Machine Learning Research site, and the same work is available on arXiv with the Apple author list and June 2025 date. (machinelearning.apple.com)