Netflix unveils VOID model
Netflix and Bulgaria’s INSAIT released VOID, an open-source model that removes objects from video while reconstructing the scene so motion stays realistic rather than jittery or inconsistent. The technical leap is preserving temporal consistency across frames, which is far harder than editing single images and matters when deploying video editing in production post‑production pipelines. Because Netflix is shipping this as open-source, it creates a hands-on case study for video reconstruction and temporal CV that goes beyond classification demos. (therecursive.com)(digitaltrends.com)
Video editing breaks in a very specific way: a single frame can look perfect, but 24 or 30 frames in a row will betray the trick with flicker, wobble, or an object that keeps “remembering” something you erased one moment earlier. That is why video object removal is harder than photo retouching. (arxiv.org) Most older systems treat the job like painting over a stain on one frame at a time. They can fill in a wall behind a person, but they usually fail when the deleted person was pushing, holding, blocking, or colliding with something else in the scene. (arxiv.org) Netflix and Bulgaria’s Institute for Computer Science, Artificial Intelligence and Technology, known as INSAIT, built a model for that second problem. They call it VOID, short for Video Object and Interaction Deletion, and the point is to remove not just the object but also the chain of effects it caused. (github.com) The paper’s examples are simple on purpose. If you remove the middle dominoes, the last yellow block should never fall, and if you remove the hands that started spinning tops, the tops should keep spinning instead of snapping into a weird new motion. (arxiv.org) To teach that behavior, the team built paired training data with Kubric and HUMOTO. Those datasets let the model compare the original clip with a “counterfactual” version of the same scene, meaning a version where one object was never there and the later physics had to change too. (arxiv.org) Under the hood, VOID is built on Alibaba’s CogVideoX and then fine-tuned for video inpainting, which is the task of filling in missing parts of a moving image. Netflix says the model uses interaction-aware mask conditioning, which means the edit tells the system not only what to erase but also which nearby regions were affected by that erased object. (github.com) Netflix released two transformer checkpoints instead of one. Pass 1 is the base inpainting model, and Pass 2 is a warped-noise refinement step that the repository says improves temporal consistency across frames. (github.com) The open-source release is unusually practical for a research project. The GitHub repository includes code, a notebook, sample assets, and Apache 2.0 licensing, although the quick-start note says running it comfortably needs a graphics processor with more than 40 gigabytes of video memory, such as an NVIDIA A100. (github.com) The masking workflow also shows how close this is to production tooling rather than a toy demo. Stage 1 uses Gemini through the Google Artificial Intelligence application programming interface, and the setup also requires Segment Anything Model 2 from Meta for mask generation before the actual video rewrite starts. (github.com) Netflix is not claiming that every erased object can now be rewritten perfectly. The paper says existing methods still do well on shadows and reflections, and VOID is aimed at the harder cases where deleting one thing should alter what happens next in the scene. (arxiv.org) That makes this release less like a magic eraser and more like a physics-aware editor. It is a rare case where a major studio is publishing a hands-on model for a post-production problem that usually stays inside proprietary visual-effects pipelines. (github.com)