Blackwell Slashes Llama 3 Training Time

Nvidia's new Blackwell GPUs trained Meta’s Llama 3.1 405B model in just 27 minutes, according to new MLCommons benchmarks. The test used only 2,496 Blackwell chips, more than halving the GPU count needed by the previous H100 generation. This leap in training efficiency reinforces Nvidia's dominance in the large-scale model training market.

The Blackwell B200 GPU is built on a dual-die chiplet design, packing 208 billion transistors. This architecture delivers up to 20 petaflops of FP4 compute, a 5x improvement over the prior H100 Hopper generation, and is manufactured using a custom TSMC 4NP process. The design's focus has shifted towards post-training inference efficiency, a significant change from previous generations that prioritized raw training FLOPS. A key innovation in Blackwell is the second-generation Transformer Engine, which adds support for FP4 precision. This halves the memory required per parameter, allowing a 70-billion parameter model to run on a single B200 GPU. This efficiency gain, combined with a new hardware decompression engine, targets major bottlenecks in data loading and processing. The MLPerf benchmarks show the B200 delivering up to 2.2 times the training performance of the H100 on large language models like GPT-3 175B. For inference, the gains are even more substantial, with claims of up to 4x faster performance than the H100, partly due to the new FP4 support. This leap is also enabled by a significant boost in memory bandwidth to 8 TB/s from 192 GB of HBM3e memory. This performance jump directly impacts the "build vs. buy" calculus for hyperscalers like Google, Amazon, and Meta. While these giants invest heavily in custom silicon like Google's TPU and Broadcom-designed ASICs for workload-specific optimization and cost savings, Nvidia's generational performance gains maintain pressure on these internal efforts. The high cost and long development cycles of custom chips mean they risk being outdated by the time they launch. The competitive landscape is therefore a race between the flexibility and ecosystem of general-purpose GPUs and the potential long-term cost and efficiency benefits of custom ASICs. While Nvidia currently dominates the training market, the battle is intensifying in the inference space, which is projected to constitute the majority of AI compute spending by 2026. Competitors like Broadcom are gaining traction by co-designing custom chips for major AI players, creating a more diversified market.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.