DeepSeek-V3.2 Model Sets New Performance Benchmarks

The DeepSeek-V3.2 AI model, running on GB300 (Blackwell Ultra) hardware, has achieved new performance benchmarks using FP4 quantization. The system reached 7,360 tokens per second per GPU in prefill and 2,816 tokens per second in mixed-context scenarios, demonstrating significant advances in AI model throughput.

- DeepSeek is an AI company based in Hangzhou, China, founded in July 2023 and funded by the Chinese hedge fund High-Flyer. The company has released a rapid succession of models, including specialized versions for coding and mathematics. - The benchmark was run on the vLLM inference engine, an open-source project from UC Berkeley's Sky Computing Lab designed for high-throughput LLM serving. vLLM uses techniques like PagedAttention, which efficiently manages the GPU's key-value (KV) cache memory to maximize performance. - FP4 quantization is a model compression technique that reduces the precision of the model's numerical weights to 4-bit floating-point numbers. This significantly lowers the memory footprint and computational demand, enabling faster inference, but poses a challenge in maintaining the model's original accuracy. - The NVIDIA GB300 "Blackwell Ultra" is a rack-scale system that can integrate 72 B300 GPUs. Each GPU features 288 GB of HBM3e memory—a 50% increase over the 192 GB in the previous generation—and is specifically designed to accelerate low-precision formats like FP4. - The Blackwell Ultra architecture delivers up to 20 petaFLOPS of FP4 sparse inference performance per GPU and features fifth-generation NVLink, providing 1.8 TB/s of interconnect bandwidth per GPU. - The predecessor, DeepSeek-V2, was a Mixture-of-Experts (MoE) model with 236 billion total parameters, of which only 21 billion were active for any given token. This architecture was designed to save on training costs and improve inference throughput compared to fully dense models. [

DeepSeek-V3.2 Model Sets New Performance Benchmarks

Get your own daily briefing