New Research Paper Unveils 'HERO' Humanoid Model

A new AI research paper introduces the HERO paradigm for humanoid robotics. The model enables open-vocabulary loco-manipulation—allowing a robot to understand and act on a wide range of commands—by leveraging large-scale vision models. This approach could significantly accelerate the development of general-purpose humanoid capabilities.

The HERO paper, published around February 18-19, 2026, presents a modular system that sidesteps the massive data requirements of traditional imitation learning. Instead of relying on extensive real-world training data, it combines the generalization capabilities of large vision models with robust control policies trained in simulation. This hybrid approach is a significant departure from end-to-end learning methods, which often struggle with precision and adaptability in novel scenarios. At its core, HERO tackles the critical challenge of end-effector (EE) tracking accuracy, a major limitation in prior humanoid manipulation attempts where errors of 8-13 cm were common. The new system reduces this tracking error by a factor of 3.2, achieving a precision of approximately 2.2-2.5 cm. This is accomplished through a learned policy that uses inverse kinematics, a neural forward model, and continuous replanning to correct for deviations and accurately control the robot's hands. The system's perception stack leverages powerful, pre-trained vision models like Grounding DINO 1.5 and SAM-3 to identify and segment objects based on natural language commands. This allows for "open-vocabulary" interaction, where the robot isn't limited to a pre-defined list of objects. In real-world tests, this architecture achieved a 90% success rate in grasping various objects on surfaces ranging from 43cm to 92cm in height, demonstrating its ability to adapt its whole body to the task. This advance in loco-manipulation comes as the humanoid robotics field is seeing a push towards real-world deployment. Companies like Tesla with Optimus Gen 2, 1X with NEO, and Boston Dynamics with its new all-electric Atlas are all targeting industrial and commercial applications. The development of general-purpose humanoids that can operate in human-centric environments is a shared goal, with analysts expecting mass production of some models to begin in 2025. The HERO framework contributes to a broader trend of Vision-Language-Action (VLA) models transforming robotics. These models bridge the gap between high-level instructions and low-level motor control, enabling robots to understand and execute complex tasks in unstructured environments. This shift from rigid, pre-programmed systems to adaptable, learning-based agents is critical for both commercial automation and potential defense applications where robots must operate in dynamic and unpredictable settings.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.