UniVidX unifies multimodal video models

- Researchers from HKUST, Stanford and other institutions posted UniVidX on May 1, describing one diffusion framework for multiple pixel-aligned video tasks. - The paper says UniVidX uses stochastic condition masking, per-modality LoRAs and cross-modal self-attention, and generalizes with fewer than 1,000 training videos. - Code and checkpoints were released on GitHub on May 4, with the paper accepted to SIGGRAPH 2026.

UniVidX is a bid to collapse several specialized video models into one shared system. The paper, posted to arXiv on May 1, describes a unified multimodal framework for “versatile video generation” that handles multiple pixel-aligned video tasks inside a single diffusion backbone. The authors say the problem with most current pipelines is fragmentation. In their telling, teams often train separate models for each setting — one for intrinsic decomposition, another for alpha matting, another for related generation tasks — which locks systems into fixed input-output mappings and makes cross-modal coordination harder. Here’s the core idea: instead of wiring a different model to every task, UniVidX treats those tasks as conditional generation in one shared multimodal space. (arxiv.org) That means the same backbone can be trained to move across different video representations depending on which modalities are given as inputs and which are treated as outputs. ### How does one model cover multiple video tasks? Stochastic Condition Masking is the mechanism that makes the setup flexible. (arxiv.org) During training, the paper says, modalities are randomly split into “clean conditions” and “noisy targets,” so the model learns omni-directional conditional generation rather than one fixed mapping. In practice, that matters because the model is not limited to a single route such as RGB-to-normal or composite-to-alpha. (arxiv.org) The training scheme is meant to let one system switch roles depending on what information is available at inference time. That is the unifying move in the paper. ### What are the technical pieces doing? Decoupled Gated LoRA is the paper’s way of adapting to different target modalities without overwriting the backbone’s native diffusion priors. (arxiv.org) The authors say they attach per-modality LoRAs and activate them only when a modality is being generated. Cross-Modal Self-Attention is the alignment layer. The paper says UniVidX shares keys and values across modalities while keeping modality-specific queries, a design intended to preserve information exchange across representations without flattening them into a single undifferentiated stream. (arxiv.org) Taken together, those three pieces — stochastic masking, per-modality LoRAs and cross-modal attention — are the reason UniVidX is being framed as a unified framework rather than a bundle of loosely connected adapters. (arxiv.org) ### What can it actually produce? The first implementation, UniVid-Intrinsic, works on RGB video and intrinsic maps including albedo, irradiance and normal maps. The second, UniVid-Alpha, works on blended RGB video and its constituent RGBA layers. (arxiv.org) Those are not headline consumer-generation tasks, but they are building-block tasks inside many video workflows. Intrinsic decomposition can help separate lighting, geometry and appearance signals, while alpha-style decomposition is useful when a system needs to split foreground and background elements. (arxiv.org) That makes the paper more relevant to production tooling than to prompt-driven video demos. This last point is an inference from the task definitions in the paper and code release. ### Why does this matter for video pipelines? The paper reports performance competitive with state-of-the-art methods across distinct tasks and says the models generalized to in-the-wild scenarios even when trained on fewer than 1,000 videos. For teams building video systems, the practical appeal is simpler orchestration. A single backbone that can switch among related decomposition and generation tasks could reduce the number of separately trained and separately served models in a pipeline. (arxiv.org) That would not remove the need for workflow logic, but it could narrow the model sprawl that often comes with multimodal video tooling. That is an inference from the architecture the authors describe. ### Where can people inspect it next? The arXiv entry says the paper has been accepted to ACM Transactions on Graphics as part of SIGGRAPH 2026. The GitHub repository, released on May 4, includes code, checkpoints and inference configurations for UniVid-Intrinsic and UniVid-Alpha, built on Wan2.1-T2V-14B as the backbone. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.