NVIDIA shows Blackwell inference wins
- NVIDIA and LMSYS spent the last week showing DeepSeek-V4 running day-one on Blackwell, with SGLang tuned around the model’s new hybrid attention design. - The big tell was architecture fit: DeepSeek-V4 cuts per-token FLOPs 73% and KV-cache burden 90% versus V3.2, but only with stack-level caching tricks. - That matters because inference speed is moving from raw GPU bragging rights to who best exploits Blackwell-specific kernels, quantization, and memory layout.
Inference serving is turning into a hardware-software co-design game. That’s the real news here. Blackwell GPUs are fast on paper, but the latest DeepSeek models only hit eye-popping numbers when the serving stack is rebuilt around their weird attention patterns, compressed caches, and low-precision expert weights. Over the past week, NVIDIA, LMSYS, and the vLLM ecosystem all pushed that point from different angles — basically saying the bottleneck is no longer just “buy more GPUs.” It’s whether your runtime knows how to use Blackwell properly. (developer.nvidia.com) ### What changed this week? DeepSeek-V4 landed on April 24, 2026, and NVIDIA immediately framed it as a Blackwell story. The model family has a 1.6T-parameter Pro version and a 284B Flash version, both with 1M-token context windows. One day later, LMSYS said SGLang and Miles had day-0 inference and RL support, with systems built specifically for DeepSeek-V4’s hybrid sparse attention, manifold-constrained hyper-connections, and FP4 expert weights. (developer.nvidia.com) ### Why is DeepSeek-V4 awkward to serve? Because its attention is not the normal “store everything, attend to everything” setup. DeepSeek-V4 mixes sliding-window attention with two compression modes, one using 4:1 top-k compression and another using 128:1 dense compression. That slashes compute and memory, but it also creates several KV pools and (developer.nvidia.com) cheaper only if the runtime can keep a very messy memory system coherent. (developer.nvidia.com) ### What did NVIDIA actually claim? NVIDIA’s pitch was less about a single benchmark screenshot and more about why Blackwell lines up with this model family. The company said DeepSeek-V4’s architecture reduces per-token inference FLOPs by 73% and KV-cache memory by 90% versus DeepSeek-V3.2. It also emphasized native support for FP4-style serving o(developer.nvidia.com)— NVIDIA is saying the next inference war is about memory traffic and context handling, not just dense-math throughput. (developer.nvidia.com) ### What did SGLang add? SGLang filled in the runtime side. Its team described ShadowRadix prefix caching for hybrid attention, HiSparse CPU-extended KV, MTP speculative decoding with in-graph metadata, Flash Compressor, Lightning TopK, and hierarchical multi-stream overlap. Those names are a mouthful, but the pattern is simple: cache less, move l(developer.nvidia.com)sformer. (lmsys.org) ### Where does vLLM fit in? vLLM is making the same pivot from a different direction. Its current releases and recipes show Blackwell-specific defaults and optimizations becoming standard — FlashInfer MLA as a default backend on Blackwell, TRTLLM used for prefill in some paths, grouped top-k kernel fusion for MoE gains, and explicit NVFP4 decoding improvements. The DeepSeek-V3.2 Blackwell guide also makes clear that (lmsys.org)class systems, not just any CUDA box. (github.com) ### Is this just benchmark theater? Not entirely. The catch is that many of these wins are real but stack-specific. A model can look brilliant with one combination of kernels, quantization, and scheduler choices, then fall back hard on another path. That’s why so much of the recent work is open-sourcing kernels, recipes, and backend defaults. Reproducibility is becoming part of the competition. (lmsys.org)ll keep coming up? Because Blackwell is where FP4 and memory-efficient MoE inference start to look practical at scale. DeepSeek-V4 was designed to exploit that. Blackwell, in turn, gives runtime authors new low-precision and throughput knobs to turn. It’s a feedback loop — model architects assume the hardware exists, and serving teams race to prove they can cash in the assumption. (developer.nvid([lmsys.org)using-nvidia-blackwell-and-gpu-accelerated-endpoints/)) ### Bottom line The headline is not just that NVIDIA showed fast inference on Blackwell. It’s that open-source serving stacks are reorganizing around Blackwell-native tricks, and DeepSeek’s newest models are forcing that shift into the open. The next bragging-rights metric won’t just be tokens per second. It’ll be who can keep long-context, sparse-attention models fast without wasting memory or breaking latency. (developer.nvidia.com)