Apple paper: models don't think
- Apple researchers put out “The Illusion of Thinking” in June 2025, arguing today’s reasoning models break on controlled puzzles as complexity rises. - The paper’s sharpest claim is a three-regime pattern: plain LLMs can win on easy tasks, reasoning models help in mid-range ones, then both collapse. - It matters because the fight is no longer “can models reason?” but “what exactly are benchmarks measuring?” after fast public rebuttals.
Apple’s paper landed because it hit a nerve the AI world already had. Reasoning models were being sold as a step beyond ordinary chatbots — models that do better by “thinking” longer before answering. Then Apple researchers showed a bunch of those systems falling apart on carefully designed puzzles as the puzzles got harder. The claim was not just that models make mistakes. It was that the visible chain-of-thought may not mean what people think it means. (machinelearning.apple.com) ### What did Apple actually test? Not the usual math olympiad or coding leaderboards. Apple built controllable puzzle environments — things like Towers of Hanoi and river-crossing tasks — where the researchers could increase complexity while keeping the underlying logic the same. That matters because standard benchmarks are messy. They can be contaminated by tra(machinelearning.apple.com)here. (machinelearning.apple.com) ### What was the headline result? Apple says frontier “large reasoning models” show a complete accuracy collapse past certain complexity thresholds. Even more interesting, the models’ reasoning effort rises with difficulty only up to a point. Then it drops off — despite still having token budget left. So the weird part is not just failure. It is failure after trying harder for a while, then seemingly giving up. (machinelearning.apple.com) ### Why did that sound so explosive? Because it cut against the simple scaling story. The popular pitch was: give models more room to think, and they will keep improving on hard problems. Apple’s paper says the picture is more jagged. On easy tasks, standard LLMs can even outperform reasoning models. On medium tasks, the extra deliberation helps. On hard ones, b(machinelearning.apple.com) it is still a real challenge to the hype. (machinelearning.apple.com) ### So did Apple prove models don’t think? No — and this is where a lot of the online discourse got sloppy. The paper raises questions about what current reasoning traces really show. It does not settle the philosophy of mind. Basically, Apple tested whether these systems reliably execute algorithmic, compositional reasoning under controlled difficulty increases(machinelearning.apple.com) “no reasoning of any kind exists.” (machinelearning.apple.com) ### Why did researchers push back so fast? Because some critics thought the benchmark design itself created fake failures. A response paper from Anthropic and Open Philanthropy argued that the Towers of Hanoi setup ran into output-length limits, and that some river-crossing cases were literally unsolvable under the stated constraints. In that reading, the models(machinelearning.apple.com)sks. (arxiv.org) ### Did later work rescue the models? Partly, but not cleanly. A follow-up paper, “Rethinking the Illusion of Thinking,” says some of Apple’s strongest failure cases weaken once the tasks are reformulated more carefully. The authors report that solvable river-crossing instances can be handled at much larger scales, while Towers of Hanoi still gets shaky around moderate complexity — around 8 disks in their(arxiv.org) wrong.” It was more like: the benchmark mixed real limits with avoidable artifacts. (arxiv.org) ### What is the real takeaway? The real fight is over measurement. If a model fails because it cannot plan, that means one thing. If it fails because the test demands exhaustive text output, that means another. And if both are happening at once, benchmark scores blur them together. Apple’s paper mattered because it forced that distinction into the open. (machinelearning.apple.com)ttom line? The paper did not prove that models are mindless autocomplete. But it did puncture the easy story that longer chains of thought equal robust reasoning. The more durable lesson is harsher and more useful — current models may be clever searchers with brittle reasoning, and the field still does not have a clean way to tell the difference. (arxiv.org)