NVIDIA Blackwell Slashes AI Inference Costs

NVIDIA's new Blackwell platform is enabling AI providers to significantly reduce the cost of large-model inference. According to the company, some providers are cutting costs by up to 10 times when using open-source models on the platform. Recent benchmarks confirm the B200 family's dominance in LLM inference performance.

- The Blackwell B200 GPU is built on a dual-die chiplet architecture, combining two GPU dies into a single package with 208 billion transistors, a significant increase from the 80 billion in the previous Hopper generation. This design is connected by a 10 TB/s chip-to-chip interface, allowing it to function as one unified GPU. - A new second-generation Transformer Engine introduces support for finer-grain data types, including 4-bit floating point (FP4) AI, which doubles the performance for next-generation models compared to previous generations. This enables the B200 to achieve up to 20 petaFLOPS of AI performance. - The GB200 Grace Blackwell Superchip combines two B200 GPUs with a single Grace CPU via a 900 GB/s interconnect, designed for large-scale AI systems. A full rack, the NVL72, integrates 36 of these Superchips, linking 72 Blackwell GPUs to act as a single massive GPU for trillion-parameter models. - Compared to the H100, a single B200 GPU increases memory capacity from 80GB of HBM3 to 192GB of HBM3e and boosts memory bandwidth from 3.35 TB/s to 8 TB/s. - The fifth-generation NVLink interconnect provides 1.8 TB/s of total bandwidth per GPU, which is double the bandwidth of the previous generation, facilitating faster communication between GPUs in multi-node setups. - While delivering up to 30 times faster inference for some large language models compared to the H100, the B200 GPU has a higher power consumption, with a Thermal Design Power (TDP) of 1000W-1200W compared to the H100's 700W. - Early adopters and major cloud providers that have announced support for the Blackwell platform include Google, Meta, Microsoft, and Oracle. - A key factor in reducing operational costs is a claimed 25x improvement in energy efficiency for inference workloads compared to the Hopper architecture, which also helps to lower the total cost of ownership.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.