LLM throughput tricks

A public experiment showed offloading the key‑value cache from GPU HBM produced roughly 3× throughput at 256 concurrent requests in that test. (x.com) Another technical thread ran bandwidth math suggesting certain Mixture‑of‑Experts setups could be as much as 7.8× faster under specific parallelism assumptions. (x.com)

Large language model serving is hitting memory limits before it hits math limits, and engineers are now moving more of the workload off the fastest memory to raise throughput. (developer.nvidia.com) A large language model stores a running notebook of past tokens called the key-value cache so it does not recompute the whole prompt every step. NVIDIA said that cache for Llama 3 70B at a 128,000-token context uses about 40 gigabytes for one user and grows linearly as more users are added. (developer.nvidia.com) That growth turns graphics memory, not raw compute, into the immediate constraint in many deployments. NVIDIA said an H100-class setup can run out of on-package high-bandwidth memory once model weights and the cache are loaded together, especially with long contexts and larger batches. (developer.nvidia.com) One fix is to treat graphics memory like a hot tier and system memory or storage like a colder tier. The vLLM project said in a January 8, 2026 post that key-value offloading can raise per-node throughput by reducing graphics-memory pressure and lowering preemption when requests do not share the same prefix. (vllm-project.github.io) That is the backdrop for the recent public benchmark claiming roughly three times throughput at 256 concurrent requests after offloading the key-value cache from high-bandwidth memory. The post was a public experiment, not a vendor standard, and the result depends on the model, context length, interconnect, and scheduler. (x.com) A second idea comes from Mixture of Experts, a design that keeps many specialist sub-networks in the model but activates only a few for each token. Hugging Face said a router sends each token to selected experts, which cuts active computation compared with running every parameter on every token. (huggingface.co) That sparsity changes the bottleneck rather than removing it. A NeurIPS 2024 paper said Mixture of Experts inference is hard to deploy because of large model size and complex communication, then reported throughput gains from dynamic gating of 6.21 times to 11.55 times on language modeling workloads and lower memory use from buffering inactive experts in central processing unit memory. (proceedings.neurips.cc) The recent thread that estimated as much as 7.8 times faster performance used bandwidth math under specific parallelism assumptions, not a universal benchmark. That kind of estimate lines up with a broader shift in inference work: once only a fraction of experts fire, moving weights and cache efficiently can matter as much as the floating-point operations themselves. (x.com; proceedings.neurips.cc) Hardware vendors are already building around that tradeoff. NVIDIA said Grace Hopper and Grace Blackwell connect central processing unit memory and graphics memory with a 900 gigabytes-per-second NVLink-C2C link, which it described as seven times the bandwidth of Peripheral Component Interconnect Express Gen 5 for memory sharing. (developer.nvidia.com) The practical question is no longer only how many parameters a model has. It is how much of the model and its running state can stay close enough to the chip to keep 256 users, or more, moving without stalling on memory traffic. (developer.nvidia.com; x.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.