Speed tricks for LLM throughput

Practitioners report cheap throughput wins by running multiple parallel agents and tuning precision and batching — one thread claimed parallel agents (8x) raised tokens/sec ~6x on Gemma4. (x.com) Another post credits FP8, continuous batching and attention paging for roughly 60% cost reductions in training/inference pipelines. (x.com)

A language model’s speed is usually capped less by math than by traffic jams in memory and scheduling. New practitioner reports say simple changes — more parallel requests, lower-precision math, and smarter batching — can lift token throughput without changing the model itself. (docs.vllm.ai) (x.com) One recent post about Google DeepMind’s Gemma 4 said running eight parallel agents raised output from roughly one stream to about six times the tokens per second on the same setup. Google released Gemma 4 on April 2, 2026, with models ranging from Effective 2B to 31B and support for long context windows up to 256,000 tokens. (x.com) (ai.google.dev) (blog.google) A second post said a stack of engineering changes — floating point 8-bit, continuous batching, and attention paging — cut costs by about 60% in training and inference pipelines. vLLM, one of the main open-source serving engines in this area, explicitly lists continuous batching, PagedAttention, and floating point 8 key-value cache support among its throughput features. (x.com) (docs.vllm.ai) (github.com) The basic problem is that large language models do not answer all requests at the same pace. If a server waits for the slowest request in a batch to finish, graphics processors sit idle while faster requests are already done. (usenix.org) (openreview.net) Continuous batching addresses that by swapping finished requests out and new requests in between decoding steps instead of waiting for the whole batch to end. The Orca paper, presented at the 16th USENIX Symposium on Operating Systems Design and Implementation in 2022, reported up to 36.9 times higher throughput than NVIDIA FasterTransformer at the same latency level on GPT-3 175B. (usenix.org 1) (usenix.org 2) Attention paging tackles a different bottleneck: the memory used to store each token’s running context, called the key-value cache. The vLLM paper said PagedAttention stores that cache in paged blocks, like virtual memory in an operating system, to cut waste and allow larger effective batches. (arxiv.org) (ar5iv.labs.arxiv.org) Lower precision changes the size of the numbers the hardware moves around. vLLM supports floating point 8 quantization options, and FlashAttention-3 reported 1.5 to 2.0 times speedups on H100 graphics processors in bfloat16, with floating point 8 reaching 1.3 petaflops per second in its tests. (docs.vllm.ai) (openreview.net) These gains are not free, and they do not always stack cleanly. Lower precision can hurt accuracy, larger batches can raise latency for individual users, and throughput numbers depend on prompt length, output length, hardware, and whether the server is full enough to keep the graphics processor busy. (openreview.net) (ai.google.dev) That is why the recent Gemma 4 claims drew attention: they describe speedups from serving tactics rather than a new model architecture. With Gemma 4 now positioned by Google for “agentic workflows” and local deployment, the immediate race is not only to build better models, but to keep more of them busy at once. (blog.google) (docs.cloud.google.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.