Visual AI and robotics heating up
Google DeepMind is partnering with Agile Robots to combine robotics hardware with Gemini‑based visual models to speed deployment and iteration, showing practical interest in closing the loop from data collection to model training ([]). Meanwhile, ex‑DeepMind researchers launched Elorian with $55 million to push visual‑AI innovation, underlining startup momentum in domain‑specific visual systems ([]).
A robot arm is only as smart as what it can see, and that has been the weak spot: most artificial intelligence systems can describe an image, but factory robots still struggle when a part is tilted, hidden, or sitting in the wrong bin. Google DeepMind’s answer is to pair its Gemini Robotics models with Agile Robots’ machines so the software learns from the messiness of real factories instead of polished demos. (agile-robots.com) This field is called robotics vision, and the basic job is simple to say but hard to do: turn camera pixels into a physical action like “pick up the blue connector” without crushing it or missing it. Google DeepMind says Gemini Robotics is built to bring multimodal reasoning and world understanding into the physical world so robots can perceive, plan, and act. (deepmind.google) The newer step is a vision-language-action model, which works like a translator between three things that usually live in separate boxes: what the robot sees, what a person asks for, and how the robot moves its joints. Google DeepMind’s research paper says Gemini Robotics is a generalist model that can directly control robots rather than just talk about what they should do. (arxiv.org) That only gets useful when the model is connected to hardware that actually works on a shop floor. Agile Robots said on March 24, 2026 that it will integrate Gemini Robotics foundation models with its own hardware in a strategic research partnership aimed at industrial environments. (agile-robots.com) Agile Robots is not a small lab project trying to find a first customer. CNBC reported the company builds sensor-based robotic arms and humanoid robots, which gives Google DeepMind a path to collect data from real tasks instead of waiting for outside partners to send it back months later. (cnbc.com) That data loop is the whole game in physical artificial intelligence. Automation World reported the partnership will focus on capturing data from real-world robotic operations to improve Google DeepMind’s models, which means every failed grasp or awkward movement can become training material for the next update. (automationworld.com) Google DeepMind is not betting on just one robot company either. TechCrunch reported Agile Robots is the latest in a string of robotics companies partnering with Google DeepMind, which suggests the lab wants its models to run across many bodies the way an operating system runs across many laptops. (techcrunch.com) At the same time, former DeepMind researchers are trying to fix the same bottleneck from the startup side. Elorian came out of stealth this week with $55 million in funding to build models that reason about images and other visual data, according to Bloomberg and the company’s own site. (bloomberg.com) (elorian.ai) Elorian’s pitch is that today’s big models still lean too heavily on text, even when the real problem is visual. Its website says industries like science, engineering, and robotics need tools grounded in visual reasoning “from the very beginning,” which is a direct shot at systems that can chat fluently but still misread diagrams, scenes, and physical layouts. (elorian.ai) Bloomberg reported Elorian was launched by former Google DeepMind researcher Andrew Dai, has hired more than a dozen people, and is targeting its first public reasoning model within about 12 months. The same report said the startup emerged at a $300 million valuation, with backing that included Menlo Ventures, Altimeter Capital, Striker Venture Partners, and Nvidia. (bloomberg.com) Put those two moves together and the pattern is pretty clear: one camp is wiring visual models into robot arms to gather better real-world data, and another is raising fresh money to build better visual reasoning itself. The race is shifting from chat windows to cameras, grippers, and the hard problem of getting machines to understand what is in front of them before they touch it. (agile-robots.com) (elorian.ai)