3WM model enables editable 3D views

- Stanford and OpenAI researchers presented 3WM, a single model for depth estimation, novel-view synthesis, and object manipulation from one image and prompts. - The ICLR 2026 poster says tasks emerge from different inference paths through one graphical model, with zero-shot control and no task-specific finetuning. - It extends 2025 LRAS work into a unified “physical world model” for 3D interaction. (openreview.net)

A 3D model tries to infer the hidden shape of a scene from flat images, so it can predict what the view looks like after a camera move. Stanford and OpenAI researchers say their new system, 3WM, does that and also edits objects inside the scene. (openreview.net) The paper is titled “Unified 3D Scene Understanding Through Physical World Modeling,” by Wanhee Lee, Klemen Kotar, Rahul Mysore Venkatesh, Jared Watrous, Honglin Chen, Khai Loong Aw, and Daniel L. K. Yamins. OpenReview lists it as an ICLR 2026 poster, published January 26, 2026 and last modified April 10, 2026. (openreview.net) Most systems in this area split the job into separate tools: one predicts depth, another generates new viewpoints, and another handles object edits. The 3WM paper says its model treats those as different prompts through one probabilistic graphical model instead. (openreview.net) In plain terms, the model takes scene elements such as red-green-blue images, optical flow, and camera pose as connected variables. It then chooses different inference paths to answer different 3D questions about the same scene. (openreview.net) The paper says novel-view synthesis comes from red-green-blue and dense-flow prompts, object manipulation from red-green-blue and sparse-flow prompts, and depth estimation from red-green-blue plus camera conditioning. Those tasks are reported as zero-shot, without task-specific training. (openreview.net) That matters because recent image-editing pipelines often rely on fine-tuned diffusion models or bolt-on depth predictors. The authors argue those systems can drift on object identity, lighting, and camera control in real-world scenes. (openreview.net) (arxiv.org) The 3WM paper claims state-of-the-art results on novel-view synthesis and 3D object manipulation, while also producing depth estimates from the same framework. It also says the model can compose actions, such as moving an object aside while navigating through a 3D environment. (openreview.net) The project builds on the team’s April 4, 2025 paper, “3D Scene Understanding Through Local Random Access Sequence Modeling,” or LRAS. That earlier work used optical flow as an intermediate representation and reported state-of-the-art results on novel-view synthesis and object manipulation, with depth estimation added through a sequence-design change. (arxiv.org) A GitHub repository for 3WM says the codebase is still under construction and the full release is not yet public. The README labels the project “ICLR 2026” and says researchers can request interim access by email. (github.com) The paper’s reviewers were broadly supportive but not unanimous. OpenReview’s meta-review says the submission received scores of 8, 6, 6, and 4, with concerns about clarity, evaluation completeness, and how broadly the demonstrated capabilities extend. (openreview.net) For now, the news is less a consumer product launch than a research claim: one model, one scene representation, three jobs that are usually separate. The test for 3WM will be whether its public code and later follow-up papers match the control and consistency the paper reports. (openreview.net) (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.