NVIDIA Nemotron 3 Nano Omni 30B

- NVIDIA on April 28 released Nemotron 3 Nano Omni, an open multimodal model that handles video, audio, images, and text inside one agent pipeline. - The key pitch is efficiency: a 30B-A3B MoE model with 256K context, native audio support, and up to 9x throughput versus comparable open stacks. - It matters because agent builders want one fast perception model, not stitched-together vision, speech, and language systems. (blogs.nvidia.com)

NVIDIA just shipped a new kind of open model for AI agents — one that tries to be the whole sensory system instead of one more part in a pile. Nemotron 3 Nano Omni takes video, audio, images, and text in one model, with one shared context window, and NVIDIA says that cuts both latency and orchestration overhead for real agent workflows. That matters because a lot of “multimodal” systems still fake unity by chaining separate spee(blogs.nvidia.com) model as the perception layer for those systems, and it released it on April 28 with weights, code, and a technical report. (blogs.nvidia.com) ### What is this thing, exactly? Nemotron 3 Nano Omni is an omni-modal reasoning model — basically a model meant to read screens, documents, charts, audio, and video together, then answer in text. NVIDIA positions it as the “eyes and ears” inside a larger agent stack, not the giant planner model that does everything. The model is available through Hugging Face, OpenRouter, build.nvidia.com, and other partner platforms, which tells you this is meant for actual deployment, not just a paper drop. (blogs.nvidia.com) ### Why do people care about one model? Because the old setup is messy. A screen recording goes to a vision model, the call goes to a speech model, the transcript goes to a language model, and then some orchestration layer tries to glue the results together. That costs time, money, and context. NVIDIA’s pitch is that a single shared perception loop is faster and keeps cross-modal details from getting lost in translation. (developer.nvid([blogs.nvidia.com)reasoning-in-a-single-efficient-open-model)) ### What are the actual specs? The backbone is a 30B-A3B hybrid mixture-of-experts model — about 31.6B total parameters, with roughly 3.2B activated per forward pass, or 3.6B including embeddings. It uses a 256K context window and adds native audio support, which is the big step up from the prior Nemotron Nano V2 VL line. NVIDIA is releasing checkpoints in BF16, FP8, and FP4 formats, which matters for people trying to run this across different GPU budgets. (blogs.nvidia.com) ### How is it getting the speedup? Mostly by not doing redundant work. The model uses multimodal token-reduction tricks, hardware-aware inference, optimized kernels, and efficient video sampling. There’s also Conv3D-style temporal-spatial processing for video. The basic idea is simple — shrink the number of tokens and hops before the expensive reasoning happens. That is how NVIDIA gets to the “up to 9x more efficient” claim against other open omni models with similar interactivity. (blogs.nvidia.com) ### Is this just faster, or also better? NVIDIA says both. The model tops six leaderboards in its launch materials and calls out strong results on MMlongbench-Doc, OCRBenchV2, WorldSense, DailyOmni, and VoiceBench. It also claims leading results in document understanding, long audio-video comprehension, and computer-use style tasks. The catch is that these are vendor-selected benchmarks, but they do line up with the product story: documents, screens, audio, and long multimodal context. (blogs.nvidia.com) ### Who is already using it? NVIDIA says Aible, ASI, Eka Care, Foxconn, H Company, Palantir, and Pyler are already adopting it, while Dell, Docusign, Infosys, Oracle, and others are evaluating it. That does not prove deep production use yet, but it does show where NVIDIA thinks demand is coming from — enterprise agents that need to watch screens, parse docs, and listen to audio without waiting around. (blogs.nvidia.com)motron releases? Earlier Nemotron 3 launches focused on text-heavy reasoning models — Nano, Super, and Ultra — for agentic systems. Nano Omni adds the missing perception layer. In March, NVIDIA was still describing Nano Omni as “upcoming.” By April 28, it had become a released model with open checkpoints and a paper, which is the real shift here. (developer.nvidia.com) line? This is NVIDIA trying to make multimodal agents less like a relay race and more like one brain with one working memory. If the speed and cost claims hold up outside NVIDIA’s own materials, Nemotron 3 Nano Omni could become a very practical default “perception sub-agent” for enterprise AI stacks. (blogs.nvidia.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.