LLMs as planners, not controllers
Recent podcast and interview threads argue the practical pattern for robot autonomy: treat large language or foundation models as high-level planners and semantic interfaces, while leaving low-level control to deterministic stacks and human supervisors. That hybrid architecture reduces real-world risk by keeping expensive physical closure loops in tested control systems and using models for decomposition, recovery, and human-readable intent. (x.com)
A robot can afford to think in sentences about where a wrench probably is. It cannot afford to think in sentences about exactly how hard to push a motor 200 times a second. (research.google) That split is the idea now showing up across robotics work: use a large language model for the “what next” layer, and keep the “move this joint now” layer in conventional control software. A 2024 paper called Plan-Seq-Learn describes that handoff as language for abstract planning, motion planning for bridging, and learned low-level control for execution. (arxiv.org) Low-level control is the robot’s inner ear and reflexes. Boston Dynamics says Atlas combines perception with mobility and manipulation control so it can keep balance, place its feet, and handle heavy objects while moving through the world. (bostondynamics.com) Large language models are better at a different job: turning a vague human request into a checklist. A paper called SMART-LLM uses them to convert high-level instructions into multi-robot task plans instead of asking them to directly run every actuator. (arxiv.org) That matters because words are cheap and falls are expensive. Google’s Robotics Transformer 1 was designed to make real-time control feasible, but it still needed 130,000 episodes across more than 700 tasks collected over 17 months from 13 robots to learn that action layer. (research.google) The newer systems are getting stronger at the planning side first. Google DeepMind’s AutoRT, SARA-RT, and RT-Trajectory were presented as tools to help robots choose actions faster and understand navigation and trajectories better, which is closer to decision support than replacing every control loop in a factory robot. (deepmind.google) When researchers do let language models into the loop, they usually add guardrails. The CoPAL system uses a large language model for corrective planning after failure feedback, while leaving low-level motion planning and execution outside the model. (arxiv.org) Industry messaging has started to sound the same. NVIDIA’s Isaac GR00T page says its stack combines robot foundation models with simulation, synthetic data pipelines, and an onboard computer, which is a full system architecture, not a chatbot bolted onto a robot arm. (developer.nvidia.com) NVIDIA is also pushing world foundation models that predict future world states in simulation. That is useful for rehearsal, data generation, and recovery planning before a robot touches a shelf, a pallet, or a person. (developer.nvidia.com) Boston Dynamics and Toyota Research Institute said this week they are building language-conditioned Large Behavior Models for Atlas on long-horizon manipulation tasks. Even there, the company still describes Atlas as an industrial machine built on decades of control and real-world deployment experience, not as a pure end-to-end language model robot. (bostondynamics.com)