Learning human-intention priors paper
- Yifan Xie, YuAn Wang, Guangyu Chen, Jinkun Liu, Yu Sun, and Wenbo Ding posted a new arXiv robotics paper on April 27. - The paper builds HA-2.2M, a 2.2 million-episode action-language dataset, and uses it to train MoT-HRA, a hierarchical robot-learning system. - The bigger idea is using human video as a planning prior, not a direct control signal, to bridge the human-robot embodiment gap.
Robotic manipulation is getting good at copying motions, but copying is not the same thing as understanding intent. That gap matters every time a robot has to infer what a person is trying to do — not just where a hand moved. A new arXiv paper, posted April 27, tries to attack exactly that problem. The team introduces a system called MoT-HRA that learns “human-intention priors” from large-scale human demonstrations, then uses those priors to guide robot manipulation. (arxiv.org) ### What is the actual news here? The paper is called *Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation*. It comes from Yifan Xie, YuAn Wang, Guangyu Chen, Jinkun Liu, Yu Sun, and Wenbo Ding, and it landed on arXiv on April 27, 2026. The core claim is simple: robots can learn more useful manipulation structure from h(arxiv.org)level intention from low-level body-specific motion. (arxiv.org) ### Why is “intention” the hard part? A human video mixes several things together at once — the scene, the object layout, the person’s hand trajectory, and all the quirks of the human body. A robot cannot just replay that bundle. Human fingers, joints, and reach differ from a robot gripper’s hardware, so direct imitation often breaks. The paper frames intention (arxiv.org)abstract than raw motion, but more useful than text alone. (arxiv.org) ### What did they build? The authors built two main pieces. First is HA-2.2M, a 2.2 million-episode action-language dataset reconstructed from heterogeneous human videos. Second is MoT-HRA, a hierarchical vision-language-action framework that uses that dataset to learn intention priors and transfer them into robotic manipulation. The dataset pipeline includes ha(arxiv.org)emporal segmentation, and language alignment — basically, turning messy internet-scale human video into something a robot learner can use. (arxiv.org) ### How is the model organized? The system is hierarchical on purpose. One layer handles embodiment-agnostic spatial planning. Another models latent human intention. A final layer maps that into embodiment-specific robot control. That decomposition is the whole trick — instead of asking one model to jump straight from pixels of humans to robot torques, the frame(arxiv.org)r more cleanly across bodies. (arxiv.org) ### What does it actually improve? The paper says MoT-HRA improves hand motion plausibility, simulated manipulation, and real-world robot tasks. The especially important phrase is “under distribution shift.” In plain English, the system is supposed to hold up better when the robot sees conditions that do not match its narrow training setup. That is where a lo(arxiv.org)ne lab scene, then get confused by small changes in object position, viewpoint, or task context. (arxiv.org) ### Why not just train on robot data? Because robot data is expensive. Human video is abundant. But raw human video is also noisy and body-specific. So the paper is part of a broader robotics push: use giant human datasets for the abstract parts of manipulation, then let robot-specific learning handle the last mile. The catch is that this only works if the inter(arxiv.org)evant intent rather than surface motion style. That is what MoT-HRA is claiming to do. (arxiv.org) ### Is this about handovers specifically? Not really — at least not from the paper itself. The work is framed more broadly around robotic manipulation, with experiments spanning hand motion generation, simulation, and real-world robot tasks. So if you saw this described mainly as a handover paper, that is too narrow. The bigger story is a general recipe for extr(arxiv.org)man demonstrations at scale. (arxiv.org) ### What is the bottom line? This paper matters because it treats human video less like a script to imitate and more like a source of structured priors about what actions are trying to achieve. If that framing holds up, it could make robots less brittle in shared environments — not because they copy us better, but because they infer the task behind the motion more reliably. (arxiv.org)