VLS boosts robot adaptation 31% on CALVIN
- University of Washington and Allen Institute for AI researchers posted VLS, a training-free method that steers frozen robot policies at test time. - The paper reports 31% better results on CALVIN and 13% on LIBERO-PRO, plus Franka robot tests under spatial and semantic shifts. - VLS targets train-test mismatch without retraining by turning vision-language models into reward generators. (arxiv.org)
Robots often fail not because they lack a skill, but because the room, object, or instruction looks slightly different from training. VLS is built to correct that mismatch at inference time. (arxiv.org) The method comes from Shuo Liu, Ishneet Sukhvinder Singh, Yiqing Xu, Jiafei Duan, and Ranjay Krishna, with affiliations including the University of Washington, the University of Oxford, the National University of Singapore, and the Allen Institute for AI. They posted the paper on arXiv on February 3, 2026. (arxiv.org) (vision-language-steering.github.io) The basic problem is distribution shift: a policy trained to place a cup in one spot may miss when the target moves near an edge or clutter. The authors say those failures usually reflect brittle imitation learning, not missing motor ability. (arxiv.org) VLS keeps the original controller frozen and changes only how actions are sampled at test time. Instead of retraining weights, it uses a vision-language model to write reward functions that score whether a candidate trajectory matches the current scene and instruction. (arxiv.org) (vision-language-steering.github.io) The system first turns camera input and language into task-relevant 3D keypoints, using Segment Anything Model and DINOv2 features as a geometric scaffold. A vision-language model then breaks the task into stages and produces differentiable rewards as PyTorch operations. (vision-language-steering.github.io) Those rewards steer denoising in three ways: gradient-based refinement, radial basis function diversity, and Feynman-Kac resampling. In plain terms, the policy keeps generating action candidates, and VLS nudges sampling toward the ones that better satisfy the instruction. (vision-language-steering.github.io) (github.com) The paper reports a 31% improvement on CALVIN, a benchmark for long-horizon manipulation, and a 13% gain on LIBERO-PRO. The project page also says VLS reached a 69% average success rate in-distribution, 19% above the frozen pi-0.5 baseline. (arxiv.org) (vision-language-steering.github.io) The authors also ran real-world tests on a Franka robot under spatial and semantic shifts, the same kinds of changes that show up when objects move or instructions vary. Their claim is that VLS maintained more stable execution without any policy fine-tuning. (arxiv.org) (vision-language-steering.github.io) That puts VLS in a fast-growing line of work that tries to make pretrained robot policies more steerable instead of more heavily retrained. The pitch is simple: if the skill is already inside the model, test-time guidance may be cheaper than another round of robot data collection. (arxiv.org) The closing claim is not that robots learned a new hand motion overnight. It is that the same frozen policy can be pushed toward the right behavior when the world shifts a little. (arxiv.org)