Nvidia Blackwell Ultra Slashes AI Inference Costs

Nvidia has launched its Blackwell Ultra GB300 NVL72 platform, which benchmarks show delivers up to 50x higher throughput and a 35x lower cost per token compared to the previous Hopper H200 generation. The architecture is particularly effective for agentic AI workloads, with third-party tests confirming a 50x improvement in tokens-per-watt efficiency. The new platform is expected to make previously uneconomical AI applications, like persistent AI agents, viable at scale.

- The Blackwell GB200 NVL72 is a full-rack system containing 72 Blackwell GPUs and 36 Grace CPUs, interconnected by a fifth-generation NVLink that provides 1.8 TB/s of GPU-to-GPU bandwidth. This high-density configuration requires liquid cooling and draws a significant 120kW of power, a substantial increase from the typical 10kW per rack in conventional data centers. - A key architectural innovation in Blackwell is its second-generation Transformer Engine, which adds support for 4-bit floating point (FP4) AI inference. This, combined with a new dedicated decompression engine, allows Blackwell to speed up database queries by up to 18 times compared to CPUs. - While Blackwell excels at low-precision AI workloads, its design allocates less silicon to 64-bit floating-point (FP64) capabilities, meaning the previous Hopper architecture can still outperform it on certain traditional high-performance computing (HPC) tasks. - In the competitive landscape, hyperscalers are increasingly developing their own custom ASICs to reduce reliance on Nvidia and optimize for specific workloads. Microsoft's recently announced Maia 200 chip, built on a 3nm process, claims three times the performance of Amazon's Trainium 3 and better performance-per-dollar, highlighting the intensifying build vs. buy dynamic. - Nvidia's primary competitive advantage extends beyond hardware to its CUDA software ecosystem, an 18-year-old platform that provides a powerful lock-in effect. In response, competitor AMD is championing an open-source approach with its ROCm platform, aiming to attract developers wary of vendor lock-in, though it still lags in maturity and widespread optimization. - The significant cost reduction in inference is not solely due to hardware; it's a combination of the Blackwell platform, optimized software like TensorRT-LLM, and the use of open-source models. For example, switching from a closed-source model on the Hopper platform to an open-source model on Blackwell with NVFP4 precision can cut the cost per million tokens from 20 cents down to 5 cents. - The growth of AI is shifting enterprise budgets, with inference now accounting for approximately 65% of AI compute spending, up from just 35% previously. This trend underscores the market importance of the cost-per-token improvements delivered by platforms like Blackwell.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.