NVIDIA Model Enables Humanoid Teleoperation from Videos

NVIDIA has introduced EgoScale, a new model that advances humanoid robot teleoperation by pre-training on egocentric videos of human actions. The model can then be fine-tuned with minimal robot-specific data to enable control of different robotic hands. The technique has been successfully demonstrated on hardware from Sharpa and the Unitree G1, showcasing a new method for skill transfer to dexterous manipulators.

EgoScale is a key component of NVIDIA's broader Project GR00T (Generalist Robot 00 Technology), an initiative to create a general-purpose foundation model for humanoid robots. The model architecture for EgoScale is a Vision-Language-Action (VLA) framework similar to GR00T N1, designed to understand and execute tasks based on multimodal inputs. This situates EgoScale as a critical effort in solving the "hand problem" – one of the most significant challenges in robotics – by enabling fine-grained manipulation. The model's pre-training on 20,854 hours of egocentric human video is a departure from traditional teleoperation methods, which often rely on direct human control via exoskeletons or joysticks. Exoskeleton-based systems create a direct physical mapping between the operator and the robot but can be complex and require intricate hardware. EgoScale's video-based approach bypasses this by learning from a massive, diverse dataset of human actions, aiming to create a more scalable and generalizable solution. A significant outcome of the EgoScale framework is the emergence of one-shot task adaptation. After the initial large-scale pre-training on human videos and a mid-training stage with some aligned human-robot data, the model can learn a completely new, complex task like folding a shirt from a single robotic demonstration. This suggests the model isn't just mimicking but has developed a foundational understanding of motor skills that can be quickly reapplied. The entire development and training pipeline is heavily accelerated by NVIDIA's simulation platforms. NVIDIA Isaac Lab, built on the Isaac Sim high-fidelity physics engine, allows for massively parallel simulation, enabling the training of policies across thousands of robot instances on a single GPU. This "sim-to-real" approach is critical for safely and rapidly developing and validating complex behaviors before deploying them on physical hardware. The hardware demonstrations on the SharpaWave hand and the Unitree G1 humanoid are notable. The SharpaWave is a high-dexterity, 22-degree-of-freedom robotic hand designed to replicate human-like manipulation. The Unitree G1 is a versatile and agile humanoid robot, providing a full-body platform to test the integration of these advanced manipulation skills. Using such sophisticated hardware showcases the model's ability to control complex, high-DoF systems. This approach of using foundation models trained on vast datasets of human behavior represents a significant trend in the field of embodied AI. By leveraging human data, companies like NVIDIA aim to overcome the bottleneck of collecting massive amounts of robot-specific data, which is often slow and expensive. This strategy is being adopted by numerous players in the humanoid robotics space, who are increasingly relying on simulation and foundation models to accelerate development.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.