SGLang JIT kernels hit over 80% of H200 memory bandwidth in DeepSeek V4 tests

- SGLang and vLLM both shipped fresh DeepSeek V4-era inference work this week, with SGLang highlighting new JIT decode kernels and vLLM releasing v0.20.0. - The eye-catching datapoints were SGLang’s claimed 80%+ H200 memory-bandwidth use and roughly 15 microsecond Top-K latency, plus vLLM’s DeepSeek V4 support. - That matters because long-context serving is turning into a memory-movement problem, not just a matrix-multiply problem anymore.

LLM serving is having one of those quiet-but-important shifts. The flashy part used to be prefill — cram a giant prompt through a model as fast as possible. But once you get into long conversations and long-context workloads, decode starts to dominate, and decode is a different beast. This week’s SGLang and vLLM updates make that pretty clear: the bottleneck is moving away from raw GEMM bragging rights and toward memory bandwidth, Top-K sampling, KV-cache movement, and scheduler overhead. (github.com) ### What actually changed? SGLang has been pushing hard on DeepSeek-oriented serving paths, especially around JIT-compiled kernels and DeepSeek-specific optimizations. Its recent releases also bundled broader inference work like piecewise CUDA graphs by default, sparse-attention support, FP8 kernels, and DeepSeek MoE tuning — all signs that the project is optimizing for real serving stacks, not j(github.com)ays ago, and one of the headline items was initial DeepSeek V4 support. (github.com) ### Why is “80% of H200 bandwidth” a big deal? Because decode on modern GPUs is often memory-bound. In prefill, the GPU spends a lot of time doing large matrix multiplies, which is where peak tensor-core math matters most. In decode, especially at long context, the runtime keeps pulling weights and KV-cache data through memory for relatively small amounts of compute. So if a kernel is really usi(github.com)tting unusually close to the hardware ceiling on the thing that now matters most. That is a bigger deal than a generic “X% faster” claim with no explanation. The same logic applies to a 15 microsecond Top-K path — sampling overhead becomes visible once everything else gets faster. (github.com) ### Why does DeepSeek make this harder? DeepSeek V4 is exactly the kind of model family that stresses serving runtimes. vLLM’s release notes call out initial DeepSeek V4 support plus several DeepSeek-specific fixes, which tells you support was not just a config flip. SGLang’s recent release notes also emphasize DeepSeek and MoE-oriented features like elastic expert parallelism and LoRA support f(github.com)wkward territory — expert routing, token dispatch, cache pressure, and lots of opportunities for tiny inefficiencies to pile up. (github.com) ### Where does Gemma 4 fit in? Gemma 4 is a nice control case because it is a large public model that people are actively trying to run on single high-memory GPUs. vLLM’s Gemma 4 recipe says the 31B model needs one 80 GB-class NVIDIA GPU in BF16, which puts it right in the zone where workstation-class Blackwell cards become interesting. So when people talk about roughly 20 tokens per second on an(github.com)er. The point is that runtime overhead on single-node decode is now worth serious engineering effort. (github.com) ### Why isn’t GEMM the whole story anymore? Because serving is a pipeline, not a benchmark chart. A fast GEMM kernel helps, but once prefill gets optimized, the next delays show up immediately — Top-K, KV reads and writes, scheduler decisions, expert dispatch, and inter-token latency. vLLM’s own metrics docs break serving into queue time, prefill time, decode time, time-to-fi(github.com)rs themselves treat decode as a first-class performance target, not an afterthought. (docs.vllm.ai) ### What’s the catch with these numbers? The catch is that many of the splashiest figures are still benchmark-path numbers, not universal production guarantees. SGLang’s public issue tracker already shows that deployment recipes for DeepSeek V4 Flash can trade off latency and throughput in messy ways depending on TP, DP, expert-parallel settings, and speculative decoding choices. So the right (docs.vllm.ai)ezing the memory-bound part of the stack, and that is where the next gains will come from.” (github.com) ### So what’s the bottom line? The interesting news is not just that one kernel got faster. It is that open-source inference work is maturing into systems engineering. SGLang’s JIT kernel push and vLLM’s DeepSeek V4 release both point the same way — long-context LLM serving is increasingly about moving data at the edge of hardware limits, then keeping the runtime from wasting those wins. (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.