VEGA‑3D boosts 3D reasoning
A new paper introduces VEGA‑3D, which uses video diffusion as a 'Latent World Simulator' to inject implicit 3D priors into multimodal LLMs and improve scene understanding and manipulation benchmarks without explicit 3D supervision. (x.com). The approach blends spatiotemporal features with gated fusion — a promising direction for robotics and spatially aware agents. (x.com)
The arXiv submission lists Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan and Xiang Bai as authors, affiliated with Huazhong University of Science and Technology and Baidu Inc. (arxiv.org) The paper and official code were both posted on March 20, 2026, and the GitHub repository for VEGA‑3D shows a March 20, 2026 release note plus recent commits and 48 stars at the time of posting. (github.com) VEGA‑3D explicitly repurposes a pre‑trained video diffusion model as a “Latent World Simulator” and extracts spatiotemporal features from intermediate noise levels to produce geometric cues. (arxiv.org) The paper introduces a token‑level adaptive gated fusion module designed to align and fuse heterogeneous generative tokens with semantic token representations inside multimodal LLM pipelines. (arxiv.org) Authors report extensive experiments “across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks” and state that VEGA‑3D outperforms state‑of‑the‑art baselines; the repository was updated with a performance figure the day of release. (arxiv.org) (github.com) The codebase expects standard 3D benchmark data in a data/ directory and references EmbodiedScan in its dataset preparation instructions; EmbodiedScan itself contains over 5,000 scans, about 1 million ego‑centric RGB‑D views, and 160,000 3D bounding boxes, which are likely among the benchmarks VEGA‑3D targets. (github.com) (tai-wang.github.io) Repository structure includes folders named llava and trl alongside training scripts, indicating provided adapters and training pipelines for integrating VEGA‑3D features with existing MLLMs and instruction‑tuning toolchains. (github.com) The paper was highlighted in the AI Native Foundation daily digest on March 20, 2026, which summarized VEGA‑3D under the headline “Generation Models Know Space” and listed the approach’s keywords and research objective. (ainativefoundation.org)