MolmoAct 2 beats proprietary benchmarks
- Ai2 released MolmoAct 2 on May 5, an open robot manipulation model that reasons in 3D, ships with code and data, and targets real deployment. - The headline numbers are unusual: up to 37× faster than MolmoAct, 720 hours of bimanual demonstrations, and benchmark wins over π0.5, GPT-5, and Gemini Robotics ER-1.5. - That matters because open robot models usually force a tradeoff between speed, hardware cost, and capability. Ai2 is claiming a cleaner package.
Robot models have had a pretty obvious problem. The smartest ones are usually closed, the open ones often need expensive hardware or lots of task-specific tuning, and the “reasoning” versions can be so slow that they stop being useful on an actual robot. That is the gap Ai2 is trying to close with MolmoAct 2, released on May 5. It is an open vision-language-action model for manipulation — the kind of system you would want for picking, placing, opening, sorting, and other real-world arm tasks — and Ai2 says it is both stronger and much faster than the first MolmoAct. ### What kind of model is this? MolmoAct 2 is a robot policy that takes in what the robot sees, the instruction it gets, and then predicts actions. The twist is that it does not just jump straight from pixels to motor commands. It builds a 3D-ish internal picture of the scene and reasons over that before acting. That is the core MolmoAct idea — action reasoning instead of pure reflex. ### What changed in version 2? Ai2 rebuilt a lot of the stack. (allenai.org) The new system uses Molmo 2-ER, a vision-language backbone specialized for embodied and spatial reasoning, trained on about 3.3 million examples. It also adds OpenFAST, an open action tokenizer trained on millions of trajectories across five robot embodiments, plus a redesigned architecture that pairs the reasoning model with a continuous-action expert. ### Why does the speed jump matter so much? (allenai.org) Because robots live in time. A chatbot can pause and think for a second. A robot arm hovering over a mug, a drawer, or a test tube rack really cannot. Ai2 says MolmoAct 2 can run up to 37× faster than its predecessor, largely because its “Think” mode only recomputes depth reasoning for parts of the scene that changed. Basically, instead of redrawing the whole map every frame, it updates the moving pieces. (arxiv.org) ### Did it actually beat closed models? On Ai2’s reported benchmarks, yes. The paper says MolmoAct2 outperformed strong baselines including π0.5 across seven simulation and real-world benchmarks. Separately, the embodied-reasoning backbone, Molmo 2-ER, beat GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. Those are not the same test suites, so the clean takeaway is narrower than “best robot model overall” — but the results are still a big deal for an open system. (allenai.org) ### What did Ai2 open up? More than just weights. Ai2 says it released the model weights, training code, complete training data, and evaluation artifacts. It also published MolmoAct2-BimanualYAM, a 720-hour teleoperated bimanual manipulation dataset that it describes as the largest open-source bimanual tabletop manipulation dataset so far. That matters because robotics papers often open one layer and keep the rest shut. ### Why is bimanual data important? (arxiv.org) Two-arm manipulation is where a lot of useful work starts to look real — folding, holding-and-inserting, stabilizing one object while moving another. It is also much harder than single-arm pick-and-place because timing and geometry matter more. A large open dataset here gives smaller labs and startups a way to work on that problem without first building a giant teleoperation pipeline from scratch. (allenai.org) ### What is the catch? The obvious one is that these are fresh, self-reported results from Ai2 and an arXiv paper, not broad independent replication yet. And “beats proprietary models” can hide a lot of benchmark nuance — different tasks, robot setups, and evaluation rules can change the story. But even with that caution, the package is notable: open weights, open data, cheaper hardware targets, and a latency story that sounds like it was designed by people who actually want the robot to move now, not after a long think. (allenai.org) ### Bottom line The important part is not just that MolmoAct 2 scored well. It is that Ai2 is arguing open robot models no longer have to choose between being interpretable, affordable, and fast enough to use. If that claim holds up, this is the kind of release that changes who gets to build serious manipulation systems. (allenai.org)