Spot + Gemini VLM demo
Boston Dynamics posted a demo of Spot running with DeepMind’s Gemini Robotics‑ER visual‑language model to perform embodied reasoning and tidy a living room. (x.com) The post showcases a VLM driving perception and instruction following on a mobile manipulator platform. (x.com)
Robots that work in homes need two layers of software: one to see and describe a room, and another to turn that description into safe movements. Boston Dynamics said a new demo pairs its Spot robot with Google DeepMind’s Gemini Robotics-ER 1.5 model to do both in a living room cleanup task. (bostondynamics.com) Boston Dynamics posted the demo in April 2026 and said Spot picked up shoes and soda cans in a residential home. The company said the project grew out of a 2025 internal hackathon and used conversational prompts instead of a hand-coded state machine for every step. (bostondynamics.com) Gemini Robotics-ER is a vision-language model, which means it takes images and text and returns structured reasoning about what is in view and what to do next. Google’s developer documentation says the preview model can identify objects, reason about spatial relationships, break a natural-language command into subtasks, and pass those results to existing robot controllers. (ai.google.dev) In the Spot demo, Boston Dynamics said it built a software layer between Gemini and Spot’s application programming interface. That layer exposed a limited tool set — navigate, capture images, identify objects, grasp, and place — so the model could call scripts that translated its decisions into robot actions. (bostondynamics.com) Google DeepMind introduced Gemini Robotics and Gemini Robotics-ER on March 12, 2025, describing them as Gemini 2.0-based models for physical machines. DeepMind said Gemini Robotics is the version that directly outputs robot actions, while Gemini Robotics-ER focuses on embodied reasoning so developers can connect the model to their own control software. (deepmind.google) That split helps explain the Boston Dynamics video: Spot is not being run by a single end-to-end model that directly drives every joint. Boston Dynamics said the model handled high-level perception and instruction following, while Spot’s own software development kit and application programming interface executed the bounded robot behaviors. (bostondynamics.com) Spot itself is a four-legged mobile robot that Boston Dynamics sells mainly for inspection, remote sensing, hazardous response, and research. The company says Spot has 14 kilograms of payload capacity, can follow predefined routes autonomously, and is already deployed in more than 1,500 customer systems. (bostondynamics.com) Boston Dynamics’ support documentation says Spot can be teleoperated or run with onboard perception and guidance, and it can carry different sensors and payloads for specific jobs. In the cleanup demo, that hardware base was paired with a robotic arm and cameras so the language model could inspect scenes before choosing a grasp or a destination. (support.bostondynamics.com, bostondynamics.com) Google’s robotics documentation also includes a warning that generative models can make mistakes and that developers remain responsible for maintaining a safe environment around the robot. Boston Dynamics said it narrowed the model’s available actions to a finite tool list, a common way to keep a language model from issuing open-ended commands to physical hardware. (ai.google.dev, bostondynamics.com) The living-room scene is a small test compared with Spot’s usual factory and plant work, but it shows the current pattern in robotics software: foundation models do the scene reading and task planning, and the robot’s native controls do the walking, grasping, and recovery. That is the setup Boston Dynamics put on display when Spot cleaned up a room one object at a time. (bostondynamics.com, deepmind.google)