Nvidia adds orchestration to the stack

Nvidia is positioning Blackwell as a rack‑scale systems play, rolling out Mission Control software to bridge hardware and workload schedulers so large clusters run more efficiently. That move underlines how orchestration and topology‑aware scheduling are becoming part of the value proposition, not just raw chip performance. (blockchain.news)

A giant artificial intelligence cluster is no longer just a pile of fast chips. Nvidia is now selling the traffic system too, with Mission Control software that decides which jobs run on which parts of a Blackwell machine. (nvidia.com) That matters because Nvidia’s new Blackwell systems are built like a whole rack-sized computer, not a single card you slide into a server. The GB200 NVL72 packs 72 Blackwell graphics processors into one rack and links them with 130 terabytes per second of NVLink bandwidth. (nvidia.com) When dozens of chips work on one model, placement becomes a performance problem. If the software puts a training job on graphics processors that are farther apart in the machine, data has to take slower paths and the job finishes later. (nvidia.com) That is what “topology-aware scheduling” means in plain English. The scheduler looks at the machine’s map — rack, node, link, and network position — and tries to keep the busiest jobs on the closest, fastest-connected hardware. (nvidia.com) Nvidia says Mission Control sits between the hardware and the job schedulers that researchers already use. In Nvidia’s own description, it combines workload-aware intelligence with observability so jobs land on the right resources and the whole rack works “in concert” with the data center around it. (nvidia.com) The company first introduced Mission Control for Blackwell infrastructure on March 18, 2025. Nvidia said the software was available for Nvidia DGX systems and would also come through major system makers, while using Run:ai technology for orchestration and claiming up to 5 times higher infrastructure utilization. (nvidia.com) By April 2026, Mission Control documentation had expanded to cover GB200 and GB300 NVL72 systems as well as DGX B200 and DGX B300 systems. Nvidia’s administration guides describe it as the control layer for multi-user clusters that handles queuing, fairness, monitoring, and recovery across these larger Blackwell deployments. (docs.nvidia.com) The hardware design explains why Nvidia keeps talking about “rack scale.” HPE’s GB200 NVL72 system page lists 72 Blackwell graphics processors, 36 Grace central processors, up to 13.5 terabytes of high-bandwidth memory, and direct liquid cooling in a 48-rack-unit cabinet. (hpe.com) Once a machine looks like that, the old sales pitch of “our chip is faster” is too small. Nvidia is trying to own the layer that decides how shared clusters are sliced up, how failures are handled, and how training jobs and inference jobs swap places without leaving expensive hardware idle. (nvidia.com) You can see the same shift in Nvidia’s newer software stack around Blackwell. Mission Control release notes for version 2.0.0 call out upgrades in fault tolerance, telemetry dashboards, recovery engines, and GB200 NVL72 support, which is the language of operating a factory, not just installing accelerators. (docs.nvidia.com) The competitive angle is that scheduling software used to sit more in the background, while the chip got the spotlight. With Blackwell rack systems, Nvidia is making the scheduler, the cluster manager, and the topology map part of the product itself, so buying the hardware increasingly means buying Nvidia’s way of running the whole machine. (blockchain.news)

Nvidia adds orchestration to the stack

Get your own daily briefing