Nvidia Blackwell Ultra Reaches FP64/FP32 Parity

Nvidia's Blackwell Ultra GPUs have eliminated the long-standing performance gap between double-precision (FP64) and single-precision (FP32) compute for the first time in over 15 years. The company's data center business reported $39.1 billion in sales for Q1 FY2026, a 69% year-over-year increase, driven by the new architecture.

- The performance gap was a deliberate market segmentation strategy by Nvidia that began around 2010 with the Fermi architecture. On consumer-grade GeForce cards, the FP64-to-FP32 performance ratio was artificially limited, declining from 1:8 on Fermi to 1:64 on the Ampere architecture, pushing scientific and HPC customers toward more expensive Tesla and Quadro cards. - The Blackwell B200 GPU features 208 billion transistors and offers up to 4.5 times the training performance of the previous H100 generation. It provides 192 GB of HBM3e memory with 8 TB/s of bandwidth, a significant increase from the H100's 80 GB of HBM3 and 3.35 TB/s of bandwidth. - Hyperscalers are increasingly designing their own custom silicon to reduce costs and optimize for specific workloads, directly competing with general-purpose GPUs. Notable examples include Amazon's Trainium and Inferentia chips, Google's Tensor Processing Units (TPUs), and Meta's "Artemis" AI chip. - This "build vs. buy" trend extends to partnerships, such as OpenAI's multi-billion dollar deal with Broadcom to develop custom AI chips. Broadcom currently holds a reported $73 billion backlog for AI-related custom silicon, indicating massive investment from large cloud and AI companies. - The Blackwell architecture introduces new, lower-precision formats like FP4, which can double the effective model size and throughput for inference workloads. However, some Blackwell models, like the B300 Ultra, significantly reduce FP64 performance to 1.2 TFLOPS to maximize low-precision AI throughput, a trade-off compared to the 34 TFLOPS on the prior H200 "Hopper" GPU. - Venture capital is aggressively funding startups aiming to disrupt the AI hardware space. Ricursive Intelligence, founded by former Google researchers to use AI for designing chips, raised $335 million at a $4 billion valuation within months of its launch. - Power consumption and data center infrastructure are becoming key differentiators; the Blackwell B200 has a Thermal Design Power (TDP) of 1000W, up from the H100's 700W, increasing the need for advanced liquid cooling solutions in data centers. - Nvidia's strategy also includes integrated "Superchips" like the Grace Hopper (GH200), which combines an Arm-based CPU with a Hopper GPU via a high-speed 900 GB/s NVLink-C2C interconnect. This design creates a unified memory pool to reduce data transfer bottlenecks between the CPU and GPU.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.