SGLang’s speculative‑decoding method doubles single‑batch throughput (batch size 1)
- SGLang’s latest docs and model guides show speculative decoding is now a first-class low-latency feature, with DeepSeek V3/R1 seeing major batch-size-1 gains. - The clearest number is 1.8x faster decoding at batch size 1 on an H200 TP8 setup, with defaults like 5 speculative steps and overlap scheduling. - That matters because single-request inference is increasingly memory-bound, so winning now means verifying more tokens per pass and moving KV cache smarter.
LLM serving is hitting a weird limit. The hard part is no longer just raw math on the GPU. It’s moving model weights and KV cache around fast enough that one-token-at-a-time decoding doesn’t waste the hardware. That’s why SGLang’s speculative decoding work matters right now — it’s showing that even at batch size 1, where latency is the whole game, you can get close to a 2x speedup by predicting several tokens ahead and then checking them in one verification pass. ### What is the basic trick? Speculative decoding splits generation into two jobs. A draft path guesses the next few tokens, and the full model verifies those guesses in a larger chunk. If the guesses are right, the system skips several expensive full decode passes. SGLang supports a bunch of versions of this idea — EAGLE-2, EAGLE-3, MTP, DFLASH, standalone draft-model decoding, and even an ngram mode with no extra model. (docs.sglang.io) ### Why is batch size 1 the hard case? At bigger batch sizes, you can hide inefficiency by keeping the GPU busy with many requests. Batch size 1 is harsher. One user is waiting on one stream of tokens, so every decode step shows up directly in latency. SGLang’s DeepSeek docs are notable because they claim a 1.8x decoding speedup at batch size 1 on H200 with tensor parallelism 8 — not just a throughput win in a packed server, but a low-latency win for a single request. (docs.sglang.io) ### Why can checking several tokens be cheap? Because decode is often memory-bound, not compute-bound. In plain English, the GPU spends a lot of time pulling weights and cache from memory rather than doing arithmetic. If that’s your bottleneck, verifying multiple drafted tokens in one pass can cost surprisingly close to verifying one token. SGLang’s own speculative-decoding docs lean on exactly this logic, and its SpecForge docs say the same thing even more directly. It’s like paying the toll once and sending several cars through together. (docs.sglang.io) ### What specific settings matter here? The important knobs are how far ahead to draft, how many branches to try, and how many draft tokens to verify at once. In SGLang’s current docs, `speculative_num_steps` defaults to 5, and the experimental overlap scheduler — enabled with `SGLANG_ENABLE_SPEC_V2=1` — is designed to overlap draft and verification work for better performance. There’s also a strong hint in the docs and issues that topk=1 chain-style decoding is the path they’re optimizing hardest for overlap mode. (sgl-project.github.io) ### Is this one method or a whole toolbox? Basically a toolbox. If the model already has multi-token prediction heads, SGLang can use MTP. If you have a separate draft model, it can run EAGLE or standalone speculative decoding. If you have no draft model at all, it can fall back to ngram speculation. The docs currently recommend EAGLE-3 for best speed and quality overall, while model-specific guides like DeepSeek expose MTP-based acceleration. (github.com) ### What’s the catch? Acceptance rate. The more tokens you guess ahead, the more you risk rejection cascades, where bad guesses force extra work and erase the win. SGLang’s newer adaptive speculative decoding feature exists for exactly this reason — it watches accepted draft length over time and switches among candidate step tiers like 1, 3, and 7 instead of assuming one fixed setting is always best. (docs.sglang.io) ### Why does this matter beyond SGLang? Because it changes what “faster inference” means. A lot of the next gains won’t come from just larger GPUs or more FLOPS. They’ll come from systems tricks that reduce memory traffic, overlap stages, and make KV-cache handling less wasteful. SGLang’s batch-size-1 numbers are a pretty clean signal that low-latency serving is now a memory-and-scheduling problem as much as a model problem. (sgl-project.github.io) ### Bottom line? The headline is simple — SGLang is showing that speculative decoding can meaningfully speed up single-request generation, not just bulk throughput. But the deeper story is that the win comes from respecting the real bottleneck. In modern LLM serving, the fastest path is often the one that touches memory fewer times. (docs.sglang.io)