GazeVLA infers human intent for teaming
- Researchers Chengyang Li, Kaiyi Xiong and team released GazeVLA, a Vision‑Language‑Intention‑Action framework that uses human gaze to predict intentions for robotic manipulation. - In experiments the gaze‑regularized VLA improved manipulation success by 4–12% and reduced required robot training steps, per the authors' published benchmark results. - The work aims to speed safe human‑robot teaming where speech is unavailable, and code/data are on GitHub. (arxiv.org)
A new paper and code release called GazeVLA teaches robots to infer human intent from gaze to guide manipulation and handovers. (lichy2004.github.io) (arxiv.org) The team models intention as discrete 2D gaze coordinates and pretrains a Vision‑Language‑Intention‑Action (VLIA) model on over 150 million egocentric human frames before fine‑tuning on robot data. (lichy2004.github.io) (github.com) Their arXiv paper reports that adding gaze regularization to VLA models raises manipulation success rates roughly 4–12% on standard benchmarks and shortens required training. (arxiv.org) (wispaper.ai) GazeVLA aims to let robots anticipate a human’s next action—helping with smoother, faster handovers and collaborative tasks when speech or explicit commands are slow or unavailable. (lichy2004.github.io) (youtube.com) Technically, the method converts temporally aggregated gaze heatmaps into patch‑level distributions and regularizes a transformer's attention via KL divergence to align model attention with human visual patterns. (arxiv.org) Authors say the gaze prior improves robustness to lighting and sensor noise and that the framework requires no eye‑tracking hardware at deployment, relying instead on existing annotated datasets. (arxiv.org) (github.com) The release sits alongside other recent gaze‑based works such as Gaze‑LLE (CVPR 2025) and gaze‑to‑action affordance studies, but GazeVLA specifically integrates gaze into VLA pipelines for cross‑embodiment transfer. (openaccess.thecvf.com) (openreview.net) Code, data processing scripts, and demos for GazeVLA are available on GitHub, and the full paper and PDF are posted on arXiv; the authors point to continued real‑robot evaluations and dataset releases next. (github.com) (arxiv.org)