vLLM hits 178 tok/s on Nemotron-30B

- vLLM users benchmarking NVIDIA’s Nemotron-3-Nano-30B-A3B on DGX Spark’s GB10 are reporting about 178 tokens per second at 50K context lengths. - The interesting part is not just speed — recent vLLM debugging shows 50K+ benchmarks can be limited by KV cache, schedulers, or even the client. - That matters because long-context serving is becoming a systems problem, not just a raw-FLOPs problem, on Blackwell-class single-node setups.

LLM serving is starting to look less like pure math and more like traffic engineering. That’s the real story here. A vLLM setup running NVIDIA’s Nemotron-3-Nano-30B-A3B on DGX Spark hardware built around the GB10 chip has been shown hitting roughly 178 tokens per second even with a 50K-token context — which is a big enough context that old intuitions about “the GPU is the bottleneck” stop being very useful. The news is not just that the number is high. It’s that the path to that number runs through cache layout, queueing, and benchmark hygiene as much as model compute. (forums.developer.nvidia.com) ### What exactly is being served? The model in the middle of this is NVIDIA Nemotron-3-Nano-30B-A3B, a 30B-parameter hybrid Mamba-Transformer MoE model that only activates a small slice of parameters per token, which is why it c(forums.developer.nvidia.com) one-off lab curiosity — it is an increasingly standard path for people trying to run long-context models on compact NVIDIA systems. (docs.vllm.ai) ### Why does 50K context change the game? At short prompts, the expensive part is usually prefill — ingesting the prompt and building the initial KV cache. At 50K tokens, you still pay that bill, but the whole system also starts stressing parts of the stack that are easier to ignore in toy demos. The cache gets huge. Memory movement starts to matter(docs.vllm.ai) serving loop can get jammed by plumbing that has nothing to do with model quality. Basically, the GPU stops being the only story. (forums.developer.nvidia.com) ### Why are people talking about KV cache so much? Because KV cache is the working set that makes autoregressive decoding fast — and at long context it becomes the thing you spend your life managing. One recent GB10 forum post u(forums.developer.nvidia.com)then to 39.9 t/s on another path — by changing how decode reads paged cache and avoiding wasteful gather or memcpy behavior. That’s the shape of the problem now: less “buy more FLOPs,” more “stop moving bytes the dumb way.” (forums.developer.nvidia.com) ### Could the benchmark itself lie to you? Yes — and turns out this is a huge part of the current conversation. A fresh vLLM GitHub issue shows `vllm bench serve` collapsing at 50K+ context not because the server got slow, but b(forums.developer.nvidia.com)o minutes. So any headline number around long-context serving needs one extra question attached: is this measuring the model server, or the client choking on its own stream? (github.com) ### Why does GB10 matter here? GB10 is the Grace Blackwell chip inside DGX Spark, and this whole setup is interesting because it brings serious long-context serving into a much smaller box than the usual multi-GPU server story. NVIDIA’s own vLLM documentation now calls out Spark-specific behavior, including unified-memory OOM risk and sequence-count limits for some Nemotron variants. (github.com)— exactly the kind of environment where scheduler and memory behavior decide whether a benchmark looks magical or broken. (docs.nvidia.com) ### So is 178 tok/s the big takeaway? The number is the hook. The deeper takeaway is the bottleneck shift. Once you can run a 30B-class long-context model on GB10-class hardware at respectable speed, the next wins come from systems work — cache compression, paged access, smarter decode kernels, cleaner concurrency, and better benchmarking. That is (docs.nvidia.com) its “IO and scheduler” era. (forums.developer.nvidia.com) ### Bottom line? vLLM hitting around 178 tok/s on Nemotron-30B is impressive, but the more useful lesson is where the pain moved. Long-context serving is no longer just a compute contest. It is a cache-and-queues contest now.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.