New Open-Source Bimanual Robot Model

A new 1-billion-parameter diffusion transformer, RDT-1B, has been released for bimanual robot manipulation. The model was pretrained on over one million multi-robot episodes and is seen as an example of foundation models scaling imitation learning. RDT-1B has been made available as an open-source project on GitHub for researchers and developers.

- To address data scarcity in bimanual manipulation, RDT-1B was fine-tuned on a new dataset of over 6,000 episodes collected on an ALOHA dual-arm robot. - A key technical innovation is its "Physically Interpretable Unified Action Space," which standardizes action representations from the 46 different robot datasets it was pre-trained on, preserving physical meanings to better transfer learned skills. - The model's architecture combines a SigLIP vision encoder and a T5-XXL language encoder with a diffusion transformer that generates actions, allowing it to understand multi-modal commands. - In practice, RDT-1B processes language instructions and RGB images from up to three camera views to predict a sequence of 64 future robot actions. - It has demonstrated strong few-shot learning capabilities, acquiring new skills from as few as one to five demonstrations on a real robot. - This model follows a trend of Vision-Language-Action (VLA) models in robotics, such as Google DeepMind's RT-2, which also learns from a combination of web-scale data and robot-specific data to improve generalization. - RDT-1B's performance is benchmarked on its dexterity, zero-shot generalization to unseen objects, and its ability to follow language instructions for complex tasks. - The model is designed to be compatible with a wide variety of manipulators, including single-arm, dual-arm, and robots with wheeled locomotion, by unifying their action spaces.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.