Nvidia's Blackwell Smashes AI Performance Records

Nvidia's new Blackwell architecture is setting records for LLM inference, powered by a new algorithm called FlashAttention-4. This new method improves memory efficiency and parallelism, directly addressing major bottlenecks in AI video models. New technical guides are already showing how to tune the system for peak performance, crucial for platforms needing to optimize for cost and speed at scale.

The recent STAC-AI benchmark saw Nvidia's GB200 NVL72 system deliver up to 3.2 times the performance of the previous generation Hopper architecture. When processing financial documents with Meta's Llama 3.1 models, the Blackwell system achieved 37,480 words per second, a significant leap from the 8,237 words per second of the dual GH200 systems. Underpinning these results is the Blackwell architecture itself, built on a dual-die design featuring 208 billion transistors. The two dies are connected by a 10 TB/s link, allowing them to function as a single, unified GPU. The architecture also introduces a second-generation Transformer Engine and fifth-generation Tensor Cores that add support for new, smaller data formats like FP4 to accelerate AI computations. FlashAttention-4 achieves its performance by expanding from a 2-stage to a 5-stage pipeline, creating specialized roles for different groups of threads called "warps" (load, store, matrix multiply, softmax, and correction). To overcome bottlenecks, it also uses a hybrid hardware-software approach, offloading exponential calculations from the limited number of Special Function Units (SFUs) onto the more numerous general-purpose CUDA cores. For video-centric workloads, the Blackwell platform includes a ninth-generation NVIDIA Encoder (NVENC) that adds hardware support for 4:2:2 encoding, improving quality and speed. It also features a sixth-generation decoder (NVDEC) that can double the throughput for H.264 video streams, directly addressing the data ingestion and processing requirements of high-resolution editing platforms. Scaling is addressed by the fifth-generation NVLink, which provides 1.8 TB/s of total bandwidth per GPU. In large configurations like the GB200 NVL72, this allows up to 72 GPUs to be interconnected, creating a massive 130 TB/s bandwidth domain that allows the entire rack to function as a single GPU for trillion-parameter models. This architectural shift from compute-bound to memory-bound performance is already being integrated into major frameworks. PyTorch now supports a FlashAttention-4 backend for its FlexAttention feature, allowing developers to build custom attention variants that can be just-in-time (JIT) compiled to run on Blackwell and Hopper GPUs, reducing the trade-off between flexibility and performance.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.