NVIDIA Nemotron 3 Nano Omni 30B

- Nvidia on April 28 introduced Nemotron 3 Nano Omni, an open multimodal model for agents that process video, audio, images, documents and text together. - Nvidia says the model uses a 30B-parameter hybrid mixture-of-experts design with 3B active parameters, 256K context, and up to 9x higher throughput. - The release expands Nvidia’s open Nemotron line into audio and video reasoning, with downloads on Hugging Face and partner rollout. (nvidia.com)

Most artificial intelligence agents still stitch together separate models for seeing, listening and reading. Nvidia says its new Nemotron 3 Nano Omni folds those jobs into one open model. (nvidia.com) (developer.nvidia.com) Nvidia unveiled Nemotron 3 Nano Omni on April 28, 2026, and said it is built for enterprise agents handling video, audio, images, documents and text in one shared context. (nvidia.com) The model uses a 30B-A3B hybrid mixture-of-experts design, which means roughly 30 billion total parameters with about 3 billion activated for a given task. Nvidia lists a 256,000-token context window and English-language support. (nvidia.com) (huggingface.co) A multimodal model is a system that can handle several kinds of input at once, instead of passing a screenshot to one model, audio to another and text to a third. Nvidia says that handoff process adds latency, raises cost and can break the shared context an agent needs to act. (developer.nvidia.com) (baseten.co) Nvidia says Nemotron 3 Nano Omni tops six leaderboards spanning document intelligence, video understanding and audio understanding. In its own materials, the company cites wins on OCRBenchV2, MMlongbench-Doc, WorldSense, DailyOmni and VoiceBench, and says MediaPerf showed the highest throughput across every tested task. (nvidia.com) (developer.nvidia.com) (huggingface.co) The company also says the model delivers up to 9x higher throughput than other open omni models with similar interactivity, and 2.9x single-stream reasoning speed on multimodal workloads. Those performance claims come from Nvidia’s launch materials and benchmark write-up. (nvidia.com) (huggingface.co) Under the hood, Nvidia says the model combines its Nemotron 3 backbone with a C-RADIOv4-H vision encoder and a Parakeet speech encoder. The goal is to preserve visual detail in documents and screens while adding native speech transcription and audio understanding. (huggingface.co 1) (huggingface.co 2) Nvidia and partners are pitching it as a perception layer for larger agent systems, not a standalone do-everything assistant. Nvidia’s launch post names Aible, Applied Scientific Intelligence, Eka Care, Foxconn, H Company, Palantir and Pyler as adopters, while Dell Technologies, DocuSign, Infosys, Oracle and Zefr are listed as evaluating it. (nvidia.com) Baseten said it added day-one support on April 28 and positioned the model for production deployments on local systems, datacenters and cloud environments. Its write-up highlights the same pitch: fewer inference passes, less orchestration and a single reasoning loop across modalities. (baseten.co) Nvidia says the model is available starting April 28 through Hugging Face, OpenRouter, build.nvidia.com and more than 25 partner platforms, with checkpoints in BF16, FP8 and NVFP4 formats. The launch turns Nemotron from an open text-heavy model family into one that now covers long-context audio and video reasoning too. (nvidia.com) (huggingface.co)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.