OmniVinci‑9B adds multi‑input media
- NVIDIA’s research team released OmniVinci-9B, an open omni-modal model for joint vision, audio and text understanding, with code and weights posted on GitHub. - The 9 billion-parameter model beats Qwen2.5-Omni on DailyOmni, MMAR and Video-MME benchmarks while using 0.2 trillion training tokens, versus 1.2 trillion. - OpenMOSS also released MOSS-Audio on April 13, expanding open multimodal tooling beyond images into speech, music and sound. (github.com)
Multimodal artificial intelligence is the idea that one model can handle several kinds of input at once, like images, video, audio and text. NVIDIA’s OmniVinci-9B is a new open model built for exactly that mix. (github.com) (huggingface.co) NVIDIA says OmniVinci-9B jointly understands vision, audio and language, and released the model with code, examples and a technical report through NVlabs and Hugging Face. The repository describes it as an “omni-modal” large language model rather than a separate image or speech system. (github.com) (huggingface.co) The model has 9 billion parameters, a rough measure of size, and NVIDIA says it outperforms Qwen2.5-Omni on three benchmarks: DailyOmni for cross-modal understanding, MMAR for audio and Video-MME for video. The reported gains are +19.05, +1.7 and +3.9 points, respectively. (github.com) NVIDIA’s technical write-up says OmniVinci uses three architecture changes to keep sound and visuals lined up over time: OmniAlignNet, Temporal Embedding Grouping and Constrained Rotary Time Embedding. In plain terms, those are methods for matching what is seen with what is heard and when each event happens. (github.com) (arxiv.org) The same report says the team built a data pipeline that generated 24 million single-modal and omni-modal conversations for training. NVIDIA also says the model used 0.2 trillion training tokens, compared with 1.2 trillion for Qwen2.5-Omni. (github.com) (arxiv.org) That matters because open multimodal models have often forced developers to stitch together separate systems for video, speech and text. OmniVinci is pitched as one model that can inspect a video with audio attached, answer questions about it and reason across both streams in one pass. (huggingface.co) (github.com) A second release points in the same direction for audio. OpenMOSS said on April 13 that it released MOSS-Audio, an open-source model for speech, environmental sound, music, captioning, question answering and reasoning. (github.com) OpenMOSS released four versions: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Thinking, MOSS-Audio-8B-Instruct and MOSS-Audio-8B-Thinking. On April 20, the team added fine-tuning code and documentation for LoRA and full-parameter training. (github.com) The two projects solve different parts of the same problem. OmniVinci is aimed at combined image, video, audio and text understanding, while MOSS-Audio focuses on the audio side alone, including speaker cues, emotion, background sounds and music. (github.com 1) (github.com 2) Together, they show how open-source model makers are moving from single-medium tools toward systems that can parse a whole scene, not just a transcript or a frame. For developers, that means fewer handoffs between separate models and more end-to-end media analysis from one stack. (github.com 1) (github.com 2)