NVIDIA Blackwell Slashes AI Inference Costs
NVIDIA's new Blackwell platform is enabling AI providers to significantly reduce the cost of large-model inference. According to the company, some providers are cutting costs by up to 10 times when using open-source models on the platform. Recent benchmarks confirm the B200 family's dominance in LLM inference performance.
- The Blackwell B200 GPU is built on a dual-die chiplet architecture, combining two GPU dies into a single package with 208 billion transistors, a significant increase from the 80 billion in the previous Hopper generation. This design is connected by a 10 TB/s chip-to-chip interface, allowing it to function as one unified GPU. - A new second-generation Transformer Engine introduces support for finer-grain data types, including 4-bit floating point (FP4) AI, which doubles the performance for next-generation models compared to previous generations. This enables the B200 to achieve up to 20 petaFLOPS of AI performance. - The GB200 Grace Blackwell Superchip combines two B200 GPUs with a single Grace CPU via a 900 GB/s interconnect, designed for large-scale AI systems. A full rack, the NVL72, integrates 36 of these Superchips, linking 72 Blackwell GPUs to act as a single massive GPU for trillion-parameter models. - Compared to the H100, a single B200 GPU increases memory capacity from 80GB of HBM3 to 192GB of HBM3e and boosts memory bandwidth from 3.35 TB/s to 8 TB/s. - The fifth-generation NVLink interconnect provides 1.8 TB/s of total bandwidth per GPU, which is double the bandwidth of the previous generation, facilitating faster communication between GPUs in multi-node setups. - While delivering up to 30 times faster inference for some large language models compared to the H100, the B200 GPU has a higher power consumption, with a Thermal Design Power (TDP) of 1000W-1200W compared to the H100's 700W. - Early adopters and major cloud providers that have announced support for the Blackwell platform include Google, Meta, Microsoft, and Oracle. - A key factor in reducing operational costs is a claimed 25x improvement in energy efficiency for inference workloads compared to the Hopper architecture, which also helps to lower the total cost of ownership.