NVIDIA shows Blackwell inference wins

- NVIDIA and LMSYS spent the last week showing DeepSeek-V4 running day-one on Blackwell, with SGLang tuned around the model’s new hybrid attention design. - The big tell was architecture fit: DeepSeek-V4 cuts per-token FLOPs 73% and KV-cache burden 90% versus V3.2, but only with stack-level caching tricks. - That matters because inference speed is moving from raw GPU bragging rights to who best exploits Blackwell-specific kernels, quantization, and memory layout.

Inference serving is turning into a hardware-software co-design game. That’s the real news here. Blackwell GPUs are fast on paper, but the latest DeepSeek models only hit eye-popping numbers when the serving stack is rebuilt around their weird attention patterns, compressed caches, and low-precision expert weights. Over the past week, NVIDIA, LMSYS, and the vLLM ecosystem all pushed that point from different angles — basically saying the bottleneck is no longer just “buy more GPUs.” It’s whether your runtime knows how to use Blackwell properly. (developer.nvidia.com) ### What changed this week? DeepSeek-V4 landed on April 24, 2026, and NVIDIA immediately framed it as a Blackwell story. The model family has a 1.6T-parameter Pro version and a 284B Flash version, both with 1M-token context windows. One day later, LMSYS said SGLang and Miles had day-0 inference and RL support, with systems built specifically for DeepSeek-V4’s hybrid sparse attention, manifold-constrained hyper-connections, and FP4 expert weights. (developer.nvidia.com) ### Why is DeepSeek-V4 awkward to serve? Because its attention is not the normal “store everything, attend to everything” setup. DeepSeek-V4 mixes sliding-window attention with two compression modes, one using 4:1 top-k compression and another using 128:1 dense compression. That slashes compute and memory, but it also creates several KV pools and (developer.nvidia.com) cheaper only if the runtime can keep a very messy memory system coherent. (developer.nvidia.com) ### What did NVIDIA actually claim? NVIDIA’s pitch was less about a single benchmark screenshot and more about why Blackwell lines up with this model family. The company said DeepSeek-V4’s architecture reduces per-token inference FLOPs by 73% and KV-cache memory by 90% versus DeepSeek-V3.2. It also emphasized native support for FP4-style serving o(developer.nvidia.com)— NVIDIA is saying the next inference war is about memory traffic and context handling, not just dense-math throughput. (developer.nvidia.com) ### What did SGLang add? SGLang filled in the runtime side. Its team described ShadowRadix prefix caching for hybrid attention, HiSparse CPU-extended KV, MTP speculative decoding with in-graph metadata, Flash Compressor, Lightning TopK, and hierarchical multi-stream overlap. Those names are a mouthful, but the pattern is simple: cache less, move l(developer.nvidia.com)sformer. (lmsys.org) ### Where does vLLM fit in? vLLM is making the same pivot from a different direction. Its current releases and recipes show Blackwell-specific defaults and optimizations becoming standard — FlashInfer MLA as a default backend on Blackwell, TRTLLM used for prefill in some paths, grouped top-k kernel fusion for MoE gains, and explicit NVFP4 decoding improvements. The DeepSeek-V3.2 Blackwell guide also makes clear that (lmsys.org)class systems, not just any CUDA box. (github.com) ### Is this just benchmark theater? Not entirely. The catch is that many of these wins are real but stack-specific. A model can look brilliant with one combination of kernels, quantization, and scheduler choices, then fall back hard on another path. That’s why so much of the recent work is open-sourcing kernels, recipes, and backend defaults. Reproducibility is becoming part of the competition. (lmsys.org)ll keep coming up? Because Blackwell is where FP4 and memory-efficient MoE inference start to look practical at scale. DeepSeek-V4 was designed to exploit that. Blackwell, in turn, gives runtime authors new low-precision and throughput knobs to turn. It’s a feedback loop — model architects assume the hardware exists, and serving teams race to prove they can cash in the assumption. (developer.nvid([lmsys.org)using-nvidia-blackwell-and-gpu-accelerated-endpoints/)) ### Bottom line The headline is not just that NVIDIA showed fast inference on Blackwell. It’s that open-source serving stacks are reorganizing around Blackwell-native tricks, and DeepSeek’s newest models are forcing that shift into the open. The next bragging-rights metric won’t just be tokens per second. It’ll be who can keep long-context, sparse-attention models fast without wasting memory or breaking latency. (developer.nvidia.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.