NVIDIA and UT Austin Open-Source Humanoid Model

Published by The Daily Scout

What happened

Researchers from UT Austin and NVIDIA have open-sourced SONIC, a behavior foundation model for humanoid robots. The model enables real-time, whole-body motion for complex loco-manipulation tasks. It supports both teleoperation and Vision-Language-Action (VLA) inference, aiming to accelerate the real-world deployment of generalist humanoid robots.

Why it matters

- The model, named SONIC, was developed by scaling up three key areas: network size (to 42 million parameters), dataset volume (over 700 hours of motion data), and compute (9,000 GPU hours). This large-scale training on human motion-capture data allows the model to learn "human motion priors" without the need for manual reward engineering for each specific skill. - The research is led by UT Austin Associate Professor Dr. Yuke Zhu, who also co-leads NVIDIA's Generalist Embodied Agent Research (GEAR) group, the team behind the project. His work focuses on the intersection of robotics, computer vision, and machine learning to create general-purpose autonomous robots. - SONIC is designed as a "System 1" controller, providing fast, reactive whole-body motor skills, analogous to human reflexes. It can be paired with a "System 2" model, like NVIDIA's GR00T N1.5 Vision-Language-Action (VLA) model, which handles slower, high-level reasoning and planning to decide what tasks to perform. - A core technical innovation is its "universal token space," which allows a single, unified policy to be controlled by diverse inputs without retraining. This enables real-time control from VR teleoperation rigs, human video feeds, text commands, and even music. - To bridge the gap between the learned model and interactive control, SONIC includes a real-time kinematic planner. This allows an operator to use a gamepad or keyboard to guide the robot's locomotion with various styles like running, sneaking, or crawling, with the planner regenerating motions in under 5 milliseconds. - The model's capabilities were demonstrated on the Unitree G1 humanoid robot, showcasing a direct path from simulation to real-world hardware deployment.

Key numbers

  • - The model, named SONIC, was developed by scaling up three key areas: network size (to 42 million parameters), dataset volume (over 700 hours of motion data), and compute (9,000 GPU hours).
  • SONIC is designed as a "System 1" controller, providing fast, reactive whole-body motor skills, analogous to human reflexes.
  • It can be paired with a "System 2" model, like NVIDIA's GR00T N1.5 Vision-Language-Action (VLA) model, which handles slower, high-level reasoning and planning to decide what tasks to perform.
  • This allows an operator to use a gamepad or keyboard to guide the robot's locomotion with various styles like running, sneaking, or crawling, with the planner regenerating motions in under 5 milliseconds.

Quick answers

What happened in NVIDIA and UT Austin Open-Source Humanoid Model?

Researchers from UT Austin and NVIDIA have open-sourced SONIC, a behavior foundation model for humanoid robots. The model enables real-time, whole-body motion for complex loco-manipulation tasks. It supports both teleoperation and Vision-Language-Action (VLA) inference, aiming to accelerate the real-world deployment of generalist humanoid robots.

Why does NVIDIA and UT Austin Open-Source Humanoid Model matter?

The model, named SONIC, was developed by scaling up three key areas: network size (to 42 million parameters), dataset volume (over 700 hours of motion data), and compute (9,000 GPU hours). This large-scale training on human motion-capture data allows the model to learn "human motion priors" without the need for manual reward engineering for each specific skill. The research is led by UT Austin Associate Professor Dr. Yuke Zhu, who also co-leads NVIDIA's Generalist Embodied Agent Research (GEAR) group, the team behind the project. His work focuses on the intersection of robotics, computer vision, and machine learning to create general-purpose autonomous robots. SONIC is designed as a "System 1" controller, providing fast, reactive whole-body motor skills, analogous to human reflexes. It can be paired with a "System 2" model, like NVIDIA's GR00T N1.5 Vision-Language-Action (VLA) model, which handles slower, high-level reasoning and planning to decide what tasks to perform. A core technical innovation is its "universal token space," which allows a single, unified policy to be controlled by diverse inputs without retraining. This enables real-time control from VR teleoperation rigs, human video feeds, text commands, and even music. To bridge the gap between the learned model and interactive control, SONIC includes a real-time kinematic planner. This allows an operator to use a gamepad or keyboard to guide the robot's locomotion with various styles like running, sneaking, or crawling, with the planner regenerating motions in under 5 milliseconds. The model's capabilities were demonstrated on the Unitree G1 humanoid robot, showcasing a direct path from simulation to real-world hardware deployment.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Published by The Daily Scout - Be the smartest in the room.