Software tricks to cut inference costs

Engineers are focusing on software-side optimizations—quantization, pruning, mixture-of-experts, KV-cache compression and kernel-level tricks—to boost tokens-per-dollar without changing hardware. Elliot Arledge summarizes these approaches as the practical lever set for squeezing more throughput from existing nodes under higher concurrency. (x.com)

Running a language model is mostly two jobs: moving numbers out of memory and multiplying them fast enough to keep the graphics chip busy. On many real workloads, the memory traffic is the bottleneck, so engineers are now chasing software changes that squeeze more answers out of the same machine. (nvidia.github.io) Quantization is the simplest trick. It stores model weights in smaller number formats like 8-bit integers or 4-bit floating point instead of larger formats, which cuts memory use and often speeds inference because less data has to cross the bus for every token. (github.com) Pruning is the next step. It removes weights, channels, or attention heads that contribute little to the final answer, so the model does fewer operations, although aggressive pruning can hurt quality if the cuts are made in the wrong places. (openreview.net) Mixture of experts changes the model layout itself. Instead of waking up every part of the network for every token, a routing layer sends each token to only a small subset of specialist blocks, which can preserve quality while cutting the amount of active computation per token. (arxiv.org) That design creates a new problem: traffic jams. If too many tokens get routed to the same expert, some parts of the machine sit idle while one expert becomes the hotspot, so current work focuses on load balancing, expert buffering, and routing policies that keep throughput high in production. (openreview.net) The key-value cache is another giant target. This cache stores the model’s running memory of previous tokens so it does not recompute the whole prompt at every step, but the cache grows with every generated token and can become one of the biggest consumers of graphics memory. (nvidia.github.io) Compressing that cache can free a surprising amount of room. Recent systems quantize the cache, offload parts of it, or reuse cached blocks across requests, and NVIDIA said one priority-based reuse feature improved cache hit rates by about 20% on repeated-context workloads. (developer.nvidia.com) Memory management matters as much as compression. The PagedAttention method from the vLLM team treats cache memory more like an operating system paging system, which reduces fragmentation and wasted space so servers can hold larger batches at once. (arxiv.org) Bigger batches only help if the server can keep accepting work without stopping to rebuild the queue. vLLM pairs PagedAttention with continuous batching, which lets new requests join between decoding steps instead of waiting for the whole batch to finish. (docs.vllm.ai) Then there are kernel tricks, which are the tiny hand-tuned routines that actually run on the graphics chip. Projects like vLLM and TensorRT-LLM fuse steps together, use optimized attention kernels, and capture repeated execution graphs so the chip spends less time on overhead and more time producing tokens. (docs.vllm.ai) Put together, these tricks all chase one number: tokens per dollar. If a team can quantize weights, shrink the key-value cache, avoid memory waste, and keep batches full, the same rack of machines can serve more users before anyone buys another graphics processor. (docs.nvidia.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.