Adam Rosler demos continuous batching
- Adam Rosler walked through continuous batching — the LLM serving trick behind ORCA and vLLM — showing how finished requests free GPU slots immediately. - The key mechanic is iteration-level scheduling: when one sequence hits an end token, the server swaps in another instead of waiting. - That matters because serving cost is often an infrastructure problem, not just a model problem.
LLM serving is full of fake bottlenecks. People talk about model size, quantization, and kernels — and those matter — but a lot of waste comes from the queue. Adam Rosler’s demo is about that layer. He walked through continuous batching, the scheduling trick popularized by ORCA and now built into systems like vLLM, and the point is simple: stop making the GPU wait for the slowest request. ### What is continuous batching? A normal static batch acts like a bus route. A fixed group of requests gets loaded together, the model runs token by token, and everyone stays on until the longest request is done. That sounds efficient, but real generations have wildly different lengths. Some requests finish early, hit an end token, and then just sit there occupying space while the batch drags on. Continuous batching fixes that by changing the batch every decode step instead of only between whole requests. (usenix.org) ### Why does static batching waste so much? Autoregressive models generate one token per forward pass. That means a request that needs 20 output tokens and a request that needs 500 are not remotely the same job. In a static batch, the short one can finish quickly but still hold memory and scheduling position until the long one exits. You get padding-like waste, idle slots, and lower throughput even though the GPU looks “busy.” That mismatch is exactly what ORCA was designed around. (usenix.org) ### What did ORCA actually add? ORCA’s core idea was iteration-level scheduling. Instead of scheduling work per request, it schedules per generation step. The engine runs one iteration for the active set, checks who finished, frees those slots, and admits new waiting requests before the next iteration. ORCA also paired that with selective batching, because not every operation benefits equally from being batched the same way inside a Transformer stack. (usenix.org) ### Where does vLLM fit in? vLLM took these ideas and made them practical for mainstream serving. Its stack combines continuous batching with PagedAttention, which manages KV-cache memory more efficiently, so the system can keep more useful work in flight without fragmenting memory as badly. That combination is why vLLM became a default answer when teams started asking how to serve open models at high throughput without lighting money on fire. (usenix.org) ### Are the speedup numbers real? Yes — but they depend on the baseline. ORCA’s paper reported a 36.9x throughput gain over NVIDIA FasterTransformer at the same latency target on GPT-3 175B serving. Later production-style benchmarks around vLLM showed up to 23x throughput improvement, and about 8x over naive batching from continuous batching alone in some setups. So when someone demos a dramatic win, the right reaction is not “impossible.” It’s “compared with what?” (nm-vllm.readthedocs.io) ### Why do people describe this as slot-swapping? Because that’s basically what it feels like operationally. One request emits its stop token, its KV-cache space and scheduler slot get reclaimed, and another waiting request takes its place on the next step. The batch becomes a rolling set of active sequences rather than a sealed container. That is the mental model Rosler’s demo is trying to make intuitive. (usenix.org) ### Why does this matter beyond one demo? Because serving economics are dominated by utilization. If your GPUs spend large chunks of decode time carrying dead weight, you need more hardware for the same user load. Continuous batching changes that math. It can raise throughput, reduce queueing, and sometimes improve latency at the same time — which is rare in systems work. Basically, it turns “we need more GPUs” into “we need a better scheduler” more often than teams expect. (usenix.org) ### Bottom line? Rosler’s demo matters because it points at a boring-looking systems trick that has outsized impact. The model did not get smarter. The GPU did not get bigger. The server just stopped waiting around. And in LLM infrastructure, that can be the difference between a cool prototype and a business that pencils out. (usenix.org) (anyscale.com)