New Open-Source Bimanual Robot Model
A new 1-billion-parameter diffusion transformer, RDT-1B, has been released for bimanual robot manipulation. The model was pretrained on over one million multi-robot episodes and is seen as an example of foundation models scaling imitation learning. RDT-1B has been made available as an open-source project on GitHub for researchers and developers.
- To address data scarcity in bimanual manipulation, RDT-1B was fine-tuned on a new dataset of over 6,000 episodes collected on an ALOHA dual-arm robot. - A key technical innovation is its "Physically Interpretable Unified Action Space," which standardizes action representations from the 46 different robot datasets it was pre-trained on, preserving physical meanings to better transfer learned skills. - The model's architecture combines a SigLIP vision encoder and a T5-XXL language encoder with a diffusion transformer that generates actions, allowing it to understand multi-modal commands. - In practice, RDT-1B processes language instructions and RGB images from up to three camera views to predict a sequence of 64 future robot actions. - It has demonstrated strong few-shot learning capabilities, acquiring new skills from as few as one to five demonstrations on a real robot. - This model follows a trend of Vision-Language-Action (VLA) models in robotics, such as Google DeepMind's RT-2, which also learns from a combination of web-scale data and robot-specific data to improve generalization. - RDT-1B's performance is benchmarked on its dexterity, zero-shot generalization to unseen objects, and its ability to follow language instructions for complex tasks. - The model is designed to be compatible with a wide variety of manipulators, including single-arm, dual-arm, and robots with wheeled locomotion, by unifying their action spaces.