vLLM runtime donated to CNCF

A Kubernetes-native inference project built on vLLM, called llm-d, was donated to the Cloud Native Computing Foundation to help run distributed LLM inference on production clusters. The contribution came at KubeCon EU and aims to make vLLM-style high-throughput serving easier to operate inside Kubernetes environments (x.com).

Most artificial intelligence apps do not fail because the model is bad. They fail because one slow request can pin a graphics processor for seconds, while the next request finishes almost instantly, and ordinary Kubernetes load balancing was built for web servers, not for that kind of traffic. (kubernetes.io, llm-d.ai, vllm.ai) vLLM became popular because it squeezes more work out of the same hardware. The project describes itself as a high-throughput, memory-efficient serving engine for large language models, with continuous batching and a technique called PagedAttention to keep graphics processors busy instead of idle. (vllm.ai) That still leaves an operations problem. Kubernetes can spread identical containers across machines and restart them when they crash, but its default service routing does not know which model replica already holds useful prompt state in memory. (kubernetes.io, llm-d.ai) That prompt state is called a key-value cache. It works like keeping a half-finished crossword on the kitchen table instead of starting from a blank grid every time, so sending a follow-up request to the same replica can skip a chunk of repeated computation. (llm-d.ai) llm-d was built to make that kind of serving work inside Kubernetes. The project says it adds inference-aware scheduling, key-value cache optimization, and distributed serving on top of vLLM so production clusters can route requests to the replica most likely to answer fastest. (llm-d.ai, cncf.io) The project started in May 2025 with Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA. The founding group later expanded to include AMD, Cisco, Hugging Face, Intel, Lambda, Mistral AI, the University of California, Berkeley, and the University of Chicago. (cncf.io, llm-d.ai) On March 24, 2026, at KubeCon plus CloudNativeCon Europe in Amsterdam, llm-d was accepted as a Cloud Native Computing Foundation Sandbox project. Sandbox is the foundation’s entry stage for new open-source projects that want neutral governance and a path into the wider cloud-native stack. (cncf.io, cncf.io) The donation is really about turning model serving into infrastructure instead of custom glue code. The CNCF announcement says llm-d is meant to sit between higher-level control planes such as KServe and lower-level engines such as vLLM, so teams do not have to hand-build every routing and scaling trick themselves. (cncf.io) The hardest part is that large language model traffic is not uniform. The llm-d team points to retrieval-augmented generation prompts with long inputs, reasoning jobs with long outputs, and multi-turn agent flows that benefit from hitting the same cache repeatedly, which is why round-robin balancing wastes time and money. (llm-d.ai) llm-d’s answer is to make routing aware of inference state. The CNCF post says it acts as a primary implementation of the Kubernetes Gateway Application Programming Interface inference extension and uses an endpoint picker for programmable, prefix-cache-aware routing. (cncf.io) So the news is not that vLLM changed owners. The news is that a Kubernetes-native layer built around vLLM just moved under Cloud Native Computing Foundation stewardship, which gives platform teams one more shared standard for running large language models on real production clusters instead of bespoke one-off stacks. (cncf.io, vllm.ai, llm-d.ai)

vLLM runtime donated to CNCF

Get your own daily briefing