Hierarchical latent planners hit ~70% success
- Meta and NYU researchers posted a new April 2026 paper showing hierarchical planning on latent world models can drive real Franka pick-and-place tasks zero-shot. - The headline result is 70% pick-and-place success from a single final goal image, versus 0% for the flat V-JEPA2-AC planner baseline. - It matters because long-horizon robot planning usually breaks on search cost and prediction drift; hierarchy cuts both while improving robustness.
Robotic planning is getting a very specific upgrade. Not a better gripper, not a bigger dataset — a better way to think several moves ahead. A team from Meta, NYU, Mila, and Brown dropped a paper in April 2026 showing that hierarchical planning on top of latent world models can make a real Franka robot solve zero-shot pick-and-place tasks that flat planners mostly botch. The eye-catching number is 70% success from just a final goal image, where the single-level baseline got 0%. (arxiv.org) ### What is the actual news here? The paper is called *Hierarchical Planning with Latent World Models*. It was submitted to arXiv on April 3, 2026 by Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, and Nicolas Ballas. The core claim is simple: take an existing world model, add a planning hierarchy at inference time, and (arxiv.org)eeding task-specific skills or rewards. (arxiv.org) ### What is a latent world model? Basically, it is a model that does not try to predict every future pixel. It compresses the scene into a learned internal representation — a latent state — and predicts how that hidden state changes when the robot acts. That matters because pixel prediction is expensive and brittle, while latent prediction can keep the task-relevant structure and ignore visual noise. The same paper (arxiv.org) latent world-model families, including V-JEPA2-AC, DINO-WM, and PLDM. (arxiv.org) ### Why do flat planners fail on long tasks? Because the robot has to search too far ahead, and every imagined step adds error. A flat planner can do okay on “greedy” tasks where every move obviously gets closer to the goal — reach, grasp, push. But pick-and-place in clutter often is not like that. Sometimes the arm has to move away from the final target first, or line up a grasp that only pays off later. The projec(arxiv.org)ly succeeds when humans manually break the task into easier subgoals. (arxiv.org) ### So what does the hierarchy add? Two planning levels in the same latent space. The high-level planner picks a macro direction using a long-horizon model. Then the low-level planner turns the first predicted latent waypoint into actual primitive actions with a short-horizon model. The clever bit is that both levels share the same representation, so the “subgoal” does not need to be hand-designed or translated betw(arxiv.org)del’s internal world. (kevinghst.github.io) ### What did it do on real robots? On a Franka dual-gripper setup, the method hit 70% success on zero-shot pick-and-place from only a final goal specification. The flat V-JEPA2-AC planner scored 0% in the same non-greedy setup. The paper and project page also mention drawer-manipulation tasks on the same platform, with the broader point being that end-to-end execution becomes possible without manually feeding intermediate goals. (arxiv.org) ### Is this only a robotics demo? No — that is the interesting part. The authors pitch the method as a planning abstraction, not a one-off robot stack. In simulation, across push manipulation and maze navigation, the hierarchical version gets higher success while using up to 4x less planning-time compute, with the project page describing roughly 3x lower planning cost in highlighted comparisons. So the gain is not just “works better,” but “searches smarter.” (arxiv.org) ### Why are people paying attention? Because this sits right on the fault line in embodied AI. Big pretrained video models like V-JEPA 2 can already learn useful physical representations from huge amounts of passive data, but turning those representations into reliable long-horizon action is the hard part. This paper suggests you may not need a whole new robot policy stack — you may just need a better planner layere(arxiv.org)s a cheaper and more modular story than end-to-end retraining every time. (arxiv.org) ### What is the catch? 70% is not solved robotics. It also comes from a specific real-robot setup, and the public code release is a minimal implementation focused on one benchmark branch rather than the full cross-model evaluation. So this looks more like a strong proof of concept than a finished general-purpose robot brain. But turns out that is enough to matter — if hierarchy consistently rescues long-horizon plan(arxiv.org)om search structure, not just bigger models. (github.com)