LLMs in robots: safety paper
- A joint Stanford/ETH paper found integrating LLMs into robots produces systematic safety risks across planning stages. (x.com) - The authors show hazards can arise during perception, plan generation, and execution, where errors compound quickly. (x.com) - They recommend stricter validation and runtime safeguards before deploying LLM‑driven planners in real environments. (x.com)
Large language models can write robot plans that look correct on paper but still produce dangerous actions in the real world. A new Stanford- and ETH Zurich-linked paper says those safety failures are systematic, not edge cases. (arxiv.org) Large language models are the text-prediction systems behind chatbots; in robotics, researchers use them as high-level planners that turn commands like “put the knife away” into step-by-step actions. The paper, posted to arXiv on April 20, 2026, is by Tao Zhang, Kaixian Qu, Zhibin Li, Jiajun Wu, Marco Hutter, Manling Li, and Fan Shi. (arxiv.org) The authors built a benchmark called DESPITE with 12,279 tasks covering physical dangers and rule-based, or normative, dangers. They evaluated 23 models and found the best planning model failed to produce a valid plan on just 0.4% of tasks but still produced dangerous plans on 28.3%. (arxiv.org) The paper separates “can the model complete the task” from “can it avoid harm while doing it.” Across 18 open-source models ranging from 3 billion to 671 billion parameters, planning performance rose from 0.4% to 99.3%, while safety awareness stayed in a much narrower 38% to 57% band. (arxiv.org) That gap matters because robot planners do not act directly on words; they act on a chain of guesses about the room, the objects, and the order of actions. The paper gives one example where an instruction to put down a knife near a child can pass a language-level safety check but still yield a dangerous action sequence if the knife remains accessible. (arxiv.org) The authors argue that bigger models look safer mostly because they are better at finishing plans, not because they are better at recognizing danger. They describe planning ability and safety awareness as multiplicative, meaning weak safety awareness keeps showing up even when task competence improves. (arxiv.org) The strongest results came from three proprietary reasoning models, which reached 71% to 81% safety awareness in the benchmark. The paper says non-reasoning proprietary models and open-source reasoning models stayed below 57%. (arxiv.org) That finding lands as robotics groups push language models beyond chat windows and into embodied systems, where a bad plan can move a gripper, open a drawer, or approach a person. The paper says improving safety awareness, plus deterministic validation and runtime safeguards, is becoming a central deployment problem as planning performance nears saturation. (arxiv.org; openreview.net) Other researchers are already building those guardrails. One recent OpenReview paper on RoboGuard reported cutting unsafe plans from 92% to under 2.5% in its setup by adding a two-stage safety architecture around LLM-enabled robots, rather than trusting the planner alone. (openreview.net) The new paper’s bottom line is narrower and harder to wave away: a robot plan that sounds sensible is not the same thing as a robot plan that is safe. As language-model planners move closer to real homes, warehouses, and labs, the authors say safety has to be checked at the action level, not assumed from fluent text. (arxiv.org)