KServe + vLLM wins
- A Red Hat engineer described a production LLM stack using KServe + llm-d + vLLM, built with Tesla collaboration. - They reported about 3x more output tokens per second and roughly 2x faster time‑to‑first‑token via prefix‑cache aware routing. - PyTorch flagged the post as a concrete Kubernetes LLMOps pattern for storage, routing and node failure handling (x.com/i/status/2046774710671945754) (x.com/i/status/2047041147839959312).
Running large language models on Kubernetes still breaks on three old problems: moving huge model files, routing each prompt to the right graphics processor, and recovering when a node dies. A Red Hat and Tesla team said a stack built from KServe, llm-d, and vLLM handled all three in Tesla production and raised output-token throughput by about 3x. (llm-d.ai) The team published the design on April 21, 2026, with authors from Red Hat and Tesla: Yuan Tang, Scott Cabrinha, Robert Shaw, and Sai Krishna. They said the setup also cut time to first token by roughly 2x by routing requests to pods that already held matching prompt prefixes in cache. (llm-d.ai) KServe is the control plane here: it starts pods, scales them up from zero, manages revisions, and keeps the client connection open while tokens stream back. Red Hat’s April 21 explainer said KServe’s newer LLMInferenceService, added in KServe v0.16, is the piece aimed at generative AI workloads rather than standard prediction models. (developers.redhat.com) vLLM is the model server that actually runs the model on accelerators, and its speed trick is a key-value cache that saves reused prompt work instead of recomputing it. The vLLM docs say Kubernetes deployments can use either KServe’s standard runtime or LLMInferenceService backed by llm-d. (docs.vllm.ai) llm-d sits between those layers and adds the routing and scheduling logic that ordinary Kubernetes services do not have. KServe’s documentation says that layer brings key-value-cache-aware scheduling, prefill-decode separation, and multi-node orchestration for large language model inference. (kserve.github.io) The storage piece was one of Tesla’s original pain points. The authors said network file systems were too slow for model weights that can reach hundreds of gigabytes, while local logical volume manager storage fixed speed but pinned pods to specific nodes and forced manual cleanup after hardware failures. (llm-d.ai) Their account described a more Kubernetes-native tradeoff: keep the fast local storage, but let the serving layer recover without hand-edited persistent volume claims every time a node disappears. The same post said simple round-robin balancing wasted vLLM’s graphics-memory cache, which is why the team moved to cache-aware routing. (llm-d.ai) That makes this less about one benchmark than about an operating pattern. Red Hat’s companion article framed the KServe-plus-llm-d design as a way to combine lifecycle management, autoscaling, and governance with distributed inference patterns and lower infrastructure cost. (developers.redhat.com) The open-source pieces are moving quickly around that pattern. KServe’s current docs describe LLMInferenceService in version 0.17, while the llm-d GitHub repository showed about 3,000 stars on April 22, 2026, a sign that the Kubernetes-native inference layer is drawing a wider audience beyond a single deployment write-up. (kserve.github.io) (github.com) For teams trying to turn chat demos into production services, the practical claim in this case is narrow and concrete: cache-aware routing and failure-aware storage design can move latency and throughput at the same time. Tesla’s write-up gave those gains a number, and KServe, llm-d, and vLLM gave them a repeatable Kubernetes shape. (llm-d.ai)