VEGA‑3D paper

New work — 'Generation Models Know Space' (VEGA‑3D) — shows video diffusion models can act as Latent World Simulators, providing implicit 3D priors that boost MLLM spatial reasoning without any 3D supervision. (x.com)

The manuscript "Generation Models Know Space" appears on arXiv as arXiv:2603.19235 and was submitted on March 19, 2026; authors listed are Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan and Xiang Bai, with affiliations to Huazhong University of Science and Technology and Baidu Inc. (arxiv.org) The paper describes VEGA-3D as a plug-and-play pipeline that extracts spatiotemporal features from intermediate-noise latents of a pre-trained video diffusion model and injects them into multimodal LLMs using a token-level adaptive gated fusion mechanism. (arxiv.org) The authors published code and assets in a GitHub repository titled H-EmbodVis/VEGA-3D, with a project announcement dated March 20, 2026 and README entries showing released training and evaluation scripts. (github.com) The repository includes preprocessing and evaluation scripts named for 3D benchmarks such as ScanRefer, ScanQA, SQA3D, Scan2Cap and Multi3DRefer, plus an eval_scanrefer.sh that exposes a generative_model_id parameter used in reproduction runs. (github.com) The paper’s abstract and experimental tables report that VEGA-3D outperforms state-of-the-art baselines on multiple 3D scene-understanding, spatial-reasoning and embodied-manipulation benchmarks, and the manuscript specifically references comparative results on a "VSI-Bench" in its tables. (arxiv.org) The release was picked up by paper-aggregation services and daily digests on March 20–21, 2026, with entries on AI Native’s daily paper digest and listings on sites aggregating new arXiv submissions. (ainativefoundation.org)

VEGA‑3D paper

Get your own daily briefing