Open VLA tops π0.5 on 7 tests

- Ai2 released MolmoAct 2 on May 5, saying its open robot model beat strong baselines including Physical Intelligence’s π0.5 across seven benchmarks. - The bigger flex is the backbone: Molmo2-ER beat GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning tests, not just robot control. - That matters because open robot stacks usually trail closed systems on real-world manipulation, data access, and latency.

Robotics models are supposed to do three jobs at once. They have to see a scene, reason about what matters in 3D, and then turn that into motor commands fast enough for a real robot to use. That last part is where a lot of flashy demos fall apart. Ai2 is claiming a real step forward here: its new open model family, MolmoAct 2, beat strong baselines including Physical Intelligence’s π0.5 across seven simulation and real-world benchmarks, and the company released the weights, code, and training data with it. ### What exactly launched? MolmoAct 2 is Ai2’s new open vision-language-action stack for robots. The core idea is simple enough: start with a model tuned for embodied reasoning, then add robot state and action modeling so the system can actually control hardware in a closed loop. Ai2 published the project on May 5, 2026, alongside code, checkpoints, and datasets for researchers to fine-tune or deploy on common robot setups. (allenai.org) ### What is the headline result? The paper’s headline is not “we made a robot model.” It’s “our open one now beats a very credible baseline.” Ai2 says MolmoAct 2 outperformed strong baselines including π0.5 in what it calls the most extensive empirical study yet for an open VLA, covering seven simulation and real-world benchmarks. That matters because π0.5 is not a toy comparison — it is Physical Intelligence’s upgraded open-world generalization model built on more than 10,000 hours of robot data. (allenai.org) ### Why does the backbone matter so much? Because the robot policy is only as good as the model doing the spatial thinking upstream. Ai2 built MolmoAct 2 on Molmo2-ER, an embodied-reasoning vision-language backbone trained on a 3.3M-sample corpus. Ai2 says that backbone beat GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. Basically, the claim is that better “think in space before acting” performance is translating into better robot behavior, not just better benchmark trivia. (arxiv.org) ### What did Ai2 change under the hood? A few things at once. The stack adds an open action tokenizer called OpenFAST, uses a flow-matching continuous-action expert for manipulation, and includes a “MolmoThink” variant that re-predicts depth only where the scene changes. That last trick is the practical one — it tries to keep 3D grounding without paying the full latency cost every step. Ai2 also says MolmoAct 2 runs up to 37x faster than the original MolmoAct. (arxiv.org) ### Why is openness part of the story? Because robotics results are notoriously hard to reproduce. Teams often release a paper and maybe some weights, but not the full recipe. Ai2 is pushing the opposite line here: full model weights, training code, and complete training data. It also released a new bimanual dataset with more than 720 hours of teleoperated demonstrations, which Ai2 describes as the largest open bimanual tabletop manipulation dataset published so far. (arxiv.org) ### Does this mean open models now lead robotics? Not exactly. The catch is that benchmark wins do not mean universal deployment wins. Ai2 itself describes these as foundation checkpoints, not one-size-fits-all policies, and notes that performance still depends on hardware, cameras, calibration, and task distribution. But the gap does look narrower than it used to. Open models are no longer just cheaper copies — they are starting to post results that force comparison with the best closed systems. (arxiv.org) ### Why should anyone outside robotics care? Because embodied AI is where model capability meets the physical world. A chatbot can be wrong and annoying. A robot can be wrong and useless — or dangerous. The hard problem has been getting systems that are both smart enough to reason about cluttered scenes and fast enough to act reliably. If Ai2’s results hold up, the interesting shift is not just “one model beat another.” It is that open, reproducible robot stacks may be getting good enough to become the default research base layer. (github.com) ### Bottom line This looks like a real benchmark moment for open robotics. The important part is not just that MolmoAct 2 topped π0.5 on seven tests. It’s that Ai2 is arguing embodied reasoning — not just bigger general-purpose language models — is becoming the key bottleneck, and now the open side has a serious answer. (arxiv.org) (allenai.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.