VEGA‑3D: latent world sim
A new paper, 'Generation Models Know Space,' introduces VEGA‑3D — it repurposes video diffusion models into a Latent World Simulator that builds implicit 3D priors for scene understanding without explicit 3D supervision. The approach extracts spatiotemporal features that improve spatial reasoning and embodied manipulation and reportedly outperforms baselines on standard benchmarks. (x.com)
The arXiv preprint "Generation Models Know Space" (arXiv:2603.19235) was submitted on March 19, 2026 and lists Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan and Xiang Bai as authors affiliated with Huazhong University of Science and Technology and Baidu Inc. (arxiv.org)) The authors published the official code and training/evaluation scripts to the H-EmbodVis/VEGA-3D GitHub repository on March 20, 2026, and the repo is released under an Apache-2.0 license with roughly 48 stars shortly after release. (github.com)) The repository README explicitly expects a data/ root containing a benchmark/ directory and an embodiedscan/ folder, signaling that the project integrates or evaluates against the EmbodiedScan suite. (github.com)) EmbodiedScan — the first-person 3D perception benchmark referenced by the repo — contains over 5,000 real scans, about 1 million ego-centric RGB‑D views, roughly 1 million language prompts and 160,000 oriented 3D boxes across hundreds of categories. (arxiv.org)) The arXiv record states the paper runs 31 pages with 12 figures and notes that a DataCite DOI is pending for the submission. (arxiv.org)) The published code includes an installation guide that requires creating a conda environment with Python 3.10 and installing the flash-attn package among other dependencies for training and evaluation. (github.com))