Nvidia's Blackwell GPUs Cut Inference Costs by 10x
Nvidia's new Blackwell B200 GPU is achieving up to 15x faster inference performance compared to the prior-generation H100 in enterprise deployments, according to benchmarks. This performance leap is enabling production inference costs to fall by as much as 10x per token. Stripe reported a 73% reduction in its own inference costs after migrating to a vLLM-powered serving infrastructure on Blackwell-class hardware.
- The Blackwell architecture is built on a custom TSMC 4NP process, packing 208 billion transistors into a dual-die design, a significant increase from the 80 billion transistors in the prior Hopper generation. - A key performance driver is the second-generation Transformer Engine, which adds support for a new 4-bit floating point (FP4) data format, doubling the inference performance and memory efficiency compared to the 8-bit format (FP8) used in Hopper. - To connect GPUs, the fifth-generation NVLink provides 1.8 TB/s of bidirectional bandwidth per GPU, which is double the bandwidth of the previous generation and over 14 times faster than the current PCIe Gen 5 standard. - This launch intensifies the "build vs. buy" dilemma for hyperscalers like Google, AWS, and Microsoft, who are developing custom ASICs (e.g., TPUs, Trainium) to optimize for specific workloads and reduce long-term costs, weighing high upfront investment against reliance on merchant silicon from vendors like Nvidia. - The B200 GPU is equipped with 192GB of HBM3e memory, delivering 8 TB/s of memory bandwidth to alleviate data bottlenecks when processing large models. - The broader market is reflecting a strategic shift, with venture capital investment in AI hardware more than doubling from 2023 to 2024 as the physical compute layer becomes a critical bottleneck and area for differentiation. - Software frameworks like vLLM are crucial for realizing these cost savings in production; vLLM's PagedAttention algorithm optimizes the use of the GPU's on-chip KV-Cache, enabling higher batch sizes and greater throughput. - The global AI chip market was valued at approximately $118 billion in 2024 and is projected to reach nearly $300 billion by 2030, with inference workloads expected to surpass training as the primary driver of data center compute demand.