NVIDIA releases Nemotron 3 Nano Omni
- NVIDIA on April 28 released Nemotron 3 Nano Omni, an open multimodal reasoning model that combines video, audio, images, and text in one system. - The model uses a 30B-A3B mixture-of-experts design, supports 256K tokens, and NVIDIA says it delivers up to 9x higher throughput than peers. - The launch extends NVIDIA’s open Nemotron 3 lineup from December into audio-enabled agent workflows on local or enterprise deployments. (nvidia.com)
Artificial intelligence agents often split work across separate vision, speech, and language models. On April 28, NVIDIA said Nemotron 3 Nano Omni combines those jobs in one open model. (nvidia.com 1) (nvidia.com 2) NVIDIA said the model can take video, audio, images, and text as input and return text, targeting document analysis, audio-video reasoning, and computer-use agents. The company released it on April 28 through Hugging Face, OpenRouter, build.nvidia.com, and more than 25 partner platforms. (nvidia.com) (huggingface.co) Under the hood, Nemotron 3 Nano Omni uses a 30B-A3B hybrid mixture-of-experts design, which means the full model is about 30 billion parameters but only a smaller expert set is active per task. NVIDIA says that setup helps raise throughput and lower inference cost without dropping multimodal accuracy. (nvidia.com 1) (nvidia.com 2) The model is also NVIDIA’s first in this multimodal line with native audio input, instead of treating speech as a separate tool bolted onto a vision-language stack. Its research paper says the system improves on Nemotron Nano V2 VL across text, images, video, and audio. (research.nvidia.com) NVIDIA says Nemotron 3 Nano Omni supports up to 256,000 tokens of context, double the 128,000-token limit cited for its predecessor in the paper. The same paper says a Conv3D-based video compression method cuts temporal video tokens by 2x. (research.nvidia.com) For users, the pitch is simpler pipelines. Instead of handing a screen recording to one model, a voice clip to another, and a document to a third, developers can run one multimodal “perception and context” model inside a larger agent system. (nvidia.com 1) (nvidia.com 2) NVIDIA says the model tops six leaderboards spanning document intelligence, video understanding, and audio understanding, including MMlongbench-Doc, OCRBenchV2, WorldSense, DailyOmni, and VoiceBench. It also says MediaPerf tests showed the highest throughput across every task it measured. (nvidia.com) (nvidia.com) The release builds on NVIDIA’s December 15, 2025 debut of the broader Nemotron 3 family, which introduced Nano, Super, and Ultra models for agentic AI systems. Nano Omni fills in the multimodal piece that NVIDIA had described in March as “upcoming.” (nvidia.com) (nvidia.com) NVIDIA says the model runs with optimized inference on Ampere, Hopper, and Blackwell graphics processors and supports BF16, FP8, and NVFP4 variants. The Hugging Face card lists commercial use under the NVIDIA Open Model Agreement and says the current release is English-only. (nvidia.com) (huggingface.co) That leaves NVIDIA with an open model family aimed less at chatbot demos than at software that watches screens, reads files, and listens to calls in one pass. The company’s argument is that fewer model handoffs can make those agents faster and cheaper to run. (nvidia.com) (nvidia.com)