Nvidia Claims 35x Drop in AI Inference Costs

Nvidia announced its new GB300 NVL72 systems can reduce inference costs for AI workloads by up to 35 times while increasing throughput per watt by 50x. The company touts the new Blackwell architecture as a catalyst for the mass adoption of real-time agentic AI. The significant cost reduction is expected to shift the economics of enterprise AI deployments and accelerate performance.

- The GB200 NVL72 rack-scale system combines 72 Blackwell GPUs and 36 Grace CPUs, functioning as a single massive GPU. It is a liquid-cooled system designed for real-time inference on trillion-parameter Large Language Models (LLMs). The total power consumption for a rack is 120 kW. - A key architectural change from the predecessor, Hopper, is a shift in focus from mixed high-performance computing (HPC) and AI workloads to a design explicitly optimized for large-scale generative AI. Blackwell doubles down on low-precision AI math (FP8/FP4) while sacrificing some of the double-precision (FP64) performance that was a focus for Hopper. - The Blackwell B200 GPU chip contains 208 billion transistors, a significant increase from the 80 billion in the Hopper H100. This is achieved using a custom 4NP TSMC process. This allows for a 30x performance increase in LLM inference workloads compared to the H100. - A second-generation Transformer Engine and new 4-bit floating point (FP4) inference capabilities contribute to the cost reduction, allowing the system to double the compute and model sizes it can support. This new engine, combined with fifth-generation NVLink, enables up to 4x faster training for LLMs compared to the H100. - The fifth-generation NVLink interconnect provides 1.8TB/s of bidirectional throughput per GPU, allowing for high-speed communication between up to 576 GPUs. This is a core component of the system's ability to handle trillion-parameter models by minimizing communication bottlenecks that can hinder performance at scale. - The high upfront cost of AI hardware is a primary barrier to enterprise adoption, with 33% of IT executives citing it as a major concern. A complete 8-GPU server system with previous generation H100s can cost between $200,000 and $400,000, and personnel costs for managing AI systems can add hundreds of thousands annually. - Beyond hardware, hidden costs in enterprise AI deployments include data engineering, which can consume 25-40% of the total budget, and the need for constant model retraining. Integrating with legacy systems can also inflate AI project costs by 40-60%. - The economics of running AI models in production, known as inference cost, is a critical factor for sustainable deployment and is often priced per million tokens processed. These costs have been declining rapidly, with some benchmarks showing a 10x year-over-year price drop for equivalent performance.

Nvidia Claims 35x Drop in AI Inference Costs

Get your own daily briefing