Nvidia Details Blackwell Ultra Architecture
Nvidia's Blackwell Ultra (GB300) GPU architecture features significant upgrades in memory bandwidth and advanced FP4/FP8 precision support. The platform is designed to deliver up to 50 times the tokens-per-watt efficiency for inference workloads. In a related move, Meta is not only committing to Blackwell GPUs but is also the first hyperscaler to deploy Nvidia’s Grace CPU-only servers at scale.
- The Blackwell B200 GPU contains 208 billion transistors, a significant increase from the 80 billion in the previous Hopper generation, and is built on a custom TSMC 4NP process. It provides up to 20 petaflops of FP4 horsepower and features a dedicated decompression engine that accelerates data processing up to 800 GB/s. - A key configuration, the GB200 NVL72, connects 72 Blackwell GPUs with 36 Grace CPUs in a single liquid-cooled rack, functioning as a massive GPU. This system delivers up to 30 times faster real-time inference for large language models compared to the H100 and is designed to handle trillion-parameter models. - Meta's adoption of standalone Grace CPUs marks a strategic shift, architecting its data centers differently for inference versus training workloads to improve cost-per-query. This move is part of a broader trend where hyperscalers are using custom and specialized silicon to optimize for specific workloads, moving beyond a one-size-fits-all GPU approach. - While hyperscalers like Google, Amazon, and Meta are developing their own custom silicon (ASICs) for better cost and power efficiency on specific workloads, Nvidia's competitive advantage remains its comprehensive CUDA software ecosystem and the high performance of its GPUs for cutting-edge AI model training. - The GB200 NVL72 rack has a total power consumption of approximately 120kW, necessitating a direct liquid cooling solution. This is a substantial increase from the roughly 40kW per rack for high-density air-cooled H100 systems. - Meta is not only a key customer for Nvidia's latest GPUs but is also deeply invested in its own custom silicon development, including the Meta Training and Inference Accelerator (MTIA). This dual approach allows them to leverage off-the-shelf performance while building in-house solutions optimized for their specific recommendation models and future AI applications. - The fifth-generation NVLink featured in the Blackwell architecture provides 1.8 TB/s of GPU-to-GPU interconnect bandwidth, a significant jump from the 900 GB/s in the previous generation, enabling much larger and more complex models to be trained as a single entity. - Nvidia's strategic partnership with Meta now extends beyond GPUs to include the adoption of Spectrum-X Ethernet networking and Confidential Computing technologies for services like WhatsApp. This deeper integration signals a full-stack alignment, moving beyond just a component supplier relationship.