LLMs fail robot planning
- A Stanford/ETH Zurich paper found large language models often fail at planning and executing robot tasks in real settings. - The study says these failures affect companies building humanoid robots, explicitly naming Figure and Tesla Optimus. - Authors warn gaps in task understanding and execution create safety risks for robot deployment ( ).
Robots use large language models as a kind of text-based foreman: the model turns a request like “put the bleach away” into a sequence of actions. A new paper from ETH Zurich and Stanford says that planner often produces unsafe steps even when the overall plan looks correct. (arxiv.org) The paper, posted April 20, 2026, introduces a benchmark called DESPITE with 12,279 robot-planning tasks that test both physical hazards and social-rule violations. Across 23 models, the authors found the best planning model failed to make a valid plan on just 0.4% of tasks but still generated dangerous plans on 28.3%. (arxiv.org) Among 18 open-source models ranging from 3 billion to 671 billion parameters, planning scores rose sharply with scale, from 0.4% to 99.3%. Safety awareness stayed much flatter, between 38% and 57%, which the authors say means bigger models get more tasks done mostly because they plan better, not because they avoid danger better. (arxiv.org) That gap matters because many robot systems split work the same way: one model reasons in language, another sees the scene, and lower-level software moves the arms and hands. If the planner picks the wrong sequence, the rest of the stack can execute a bad idea faithfully. (nature.com, arxiv.org) The paper lands as humanoid companies are pushing from demos toward deployment. Tesla says Optimus is meant to be a “general purpose, bi-pedal, autonomous humanoid robot,” and its April 22, 2026 quarterly filing said the first-generation line is designed for 1 million robots a year. (tesla.com, ir.tesla.com) Figure is making a similar pitch around home and workplace tasks. Its site says Figure 03 is a general-purpose humanoid for everyday use, and its January 2026 Helix 02 update said one neural system can control the full body for walking, balancing, and manipulation. (figure.ai, figure.ai) The authors explicitly cite Figure and Tesla Optimus as examples of companies using large-model reasoning in humanoid systems. Their warning is narrow but direct: better task completion rates do not guarantee that a robot understands when a step is dangerous. (arxiv.org) The paper also found a split between model types. Three proprietary reasoning models reached safety-awareness scores of 71% to 81%, while non-reasoning proprietary models and open-source reasoning models stayed below 57%. (arxiv.org) Researchers in robotics have been trying to combine language models with more rigid planning and control systems for exactly this reason. A 2025 Nature Machine Intelligence review said long-horizon robot planning works best when large language models are paired with classical control methods instead of trusted on their own. (nature.com) The new result does not say humanoid robots cannot work; it says the language model in the loop can miss hazards in systematic ways. As companies scale from pilot tasks to broader deployment, the paper argues that safety checks have to improve faster than the robots’ ability to finish the job. (arxiv.org)