IBM/Red Hat/Google donate llm-d to CNCF

IBM Research, Red Hat and Google Cloud handed llm-d (built on vLLM) to the CNCF at KubeCon EU, positioning it as a production-grade distributed LLM inference option for Kubernetes clusters. That move directly targets teams trying to run scalable, Kubernetes-native inference rather than bespoke solo stacks. (x.com)

CNCF formally accepted llm-d into its Sandbox on March 24, 2026, marking the project’s entry into the foundation’s governance and lifecycle process. (cncf.io) llm-d was originally launched in May 2025 as a collaborative initiative led by Red Hat with founding contributors including CoreWeave and NVIDIA and early participation from Google Cloud and IBM Research. (llm-d.ai) The project’s runtime architecture disaggregates inference by splitting prefill and decode across separate pods to prevent KV-cache fragmentation and GPU saturation under high-concurrency workloads. (pulumi.com) llm-d exposes a modular, Kubernetes-native data plane that sits between control planes (examples cited include KServe) and low-level inference engines, enabling orchestration of cache locality, routing, and instance placement. (cncf.io) The project implements an Endpoint Picker (EPP) as a primary implementation of the Kubernetes Gateway API Inference Extension (GAIE) to make routing decisions based on realtime KV-cache hit rates, in-flight requests, and instance queue depth. (cncf.io) Google Cloud says it integrated llm-d into a GKE Inference Gateway and validated it in Vertex AI, reporting Time‑to‑First‑Token reductions of >35%, P95 tail latency improvements of 52%, and a prefix cache‑hit rate increase from ~35% to ~70% in their internal tests. (cloud.google.com) At KubeCon the contributor set and ecosystem list cited by project posts includes NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, Mistral AI, UC Berkeley and the University of Chicago in addition to IBM, Red Hat and Google Cloud. (cncf.io) Alongside llm-d, Google and partners announced open-sourcing TPU and GPU drivers for the Dynamic Resource Allocation (DRA) API at KubeCon to accelerate DRA adoption and reduce vendor-lock concerns for hardware acceleration on Kubernetes. (kube.fm) As a CNCF Sandbox project, llm-d is at the CNCF entry maturity level (experimental/early-stage); CNCF project criteria indicate promotion to Incubation typically requires demonstrable stability, governance, and production adoption (often cited as production deployments across multiple independent organizations). (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.