Nvidia Blackwell Ultra Slashes AI Inference Costs
Nvidia has launched its Blackwell Ultra GB300 NVL72 platform, which benchmarks show delivers up to 50x higher throughput and a 35x lower cost per token compared to the previous Hopper H200 generation. The architecture is particularly effective for agentic AI workloads, with third-party tests confirming a 50x improvement in tokens-per-watt efficiency. The new platform is expected to make previously uneconomical AI applications, like persistent AI agents, viable at scale.
- The Blackwell GB200 NVL72 is a full-rack system containing 72 Blackwell GPUs and 36 Grace CPUs, interconnected by a fifth-generation NVLink that provides 1.8 TB/s of GPU-to-GPU bandwidth. This high-density configuration requires liquid cooling and draws a significant 120kW of power, a substantial increase from the typical 10kW per rack in conventional data centers. - A key architectural innovation in Blackwell is its second-generation Transformer Engine, which adds support for 4-bit floating point (FP4) AI inference. This, combined with a new dedicated decompression engine, allows Blackwell to speed up database queries by up to 18 times compared to CPUs. - While Blackwell excels at low-precision AI workloads, its design allocates less silicon to 64-bit floating-point (FP64) capabilities, meaning the previous Hopper architecture can still outperform it on certain traditional high-performance computing (HPC) tasks. - In the competitive landscape, hyperscalers are increasingly developing their own custom ASICs to reduce reliance on Nvidia and optimize for specific workloads. Microsoft's recently announced Maia 200 chip, built on a 3nm process, claims three times the performance of Amazon's Trainium 3 and better performance-per-dollar, highlighting the intensifying build vs. buy dynamic. - Nvidia's primary competitive advantage extends beyond hardware to its CUDA software ecosystem, an 18-year-old platform that provides a powerful lock-in effect. In response, competitor AMD is championing an open-source approach with its ROCm platform, aiming to attract developers wary of vendor lock-in, though it still lags in maturity and widespread optimization. - The significant cost reduction in inference is not solely due to hardware; it's a combination of the Blackwell platform, optimized software like TensorRT-LLM, and the use of open-source models. For example, switching from a closed-source model on the Hopper platform to an open-source model on Blackwell with NVFP4 precision can cut the cost per million tokens from 20 cents down to 5 cents. - The growth of AI is shifting enterprise budgets, with inference now accounting for approximately 65% of AI compute spending, up from just 35% previously. This trend underscores the market importance of the cost-per-token improvements delivered by platforms like Blackwell.