Inference speed gains
- Red Hat AI and Tesla reported LLM inference improvements using KServe, llm-d, and vLLM in production tests. - They claimed roughly 3x output tokens per second and 2x faster time‑to‑first‑token for Llama 3.1 70B. - Those optimizations matter for real‑time collaboration features where latency and token throughput drive usability ( )
Getting an answer from a large language model is getting faster in some production setups: Red Hat engineers and Tesla engineers said a Kubernetes stack built around KServe, llm-d, and vLLM delivered about 3x more output tokens per second and cut time-to-first-token in half for Llama 3.1 70B. (llm-d.ai) The April 21, 2026 post said those numbers came from serving Meta’s Llama 3.1 70B on four AMD MI300X graphics processors, with prefix-cache-aware routing turned on. The authors were Yuan Tang and Robert Shaw of Red Hat and Scott Cabrinha and Sai Krishna of Tesla. (llm-d.ai) Inference is the part after training, when a model reads a prompt and starts generating tokens, or chunks of text, back to a user. Time-to-first-token measures the wait before the first word appears, while output tokens per second measures how fast the rest of the answer streams. (developers.redhat.com) The stack splits jobs across layers: KServe manages deployment, scaling, revisions, and routing on Kubernetes, while vLLM runs the model and llm-d adds distributed scheduling and cache-aware traffic handling. Red Hat described KServe as the control plane for model serving and said KServe’s LLMInferenceService was added in v0.16 for large language model workloads. (developers.redhat.com) (kserve.github.io) The cache in question is the key-value cache, a saved record of earlier prompt computation that lets a model avoid redoing the same work. KServe’s March 13, 2026 release notes for v0.17 said the project added key-value-cache-aware routing, disaggregated prefill and decode, and distributed inference features built on the llm-d framework. (kserve.github.io) That matters most for long prompts and repeated context, where sending a request to the “right” graphics processor can save work and lower delay. The llm-d and Tesla engineers said simple round-robin balancing wastes those cache hits, which is expensive when graphics processor time is the main cost. (llm-d.ai) The companies framed the result as an operations story as much as a speed story. Their post said earlier deployments based on a Kubernetes StatefulSet and network file storage created bottlenecks, while local storage improved speed but tied pods to specific nodes and made failures harder to recover from. (llm-d.ai) llm-d itself is newer than the other pieces. Red Hat announced the llm-d community in May 2025 as an open-source, Kubernetes-native distributed inference project, and the public repository now describes it as a production stack for serving models across accelerators and infrastructure providers. (developers.redhat.com) (github.com) The claim is still a vendor-published benchmark from one deployment, not an independent, apples-to-apples industry test. But it points to where model serving work has moved in 2026: less on training bigger models, and more on shaving delay and squeezing more text out of the same hardware. (llm-d.ai) (kserve.github.io)