llm-d Joins CNCF Sandbox

llm-d, a project for AI inference on Kubernetes, entered the CNCF Sandbox to help bridge prototype deployments and production inference at scale. The move ties into a broader trend where Kubernetes is treated as an 'AI inference OS'—widely used but underutilized for daily model deploys—so llm-d aims to simplify model-aware routing and distributed serving. (x.com)(x.com)

At KubeCon Europe on March 24, 2026 the project maintainers — IBM Research, Red Hat, and Google Cloud — formally contributed llm-d into the Cloud Native Computing Foundation’s Sandbox, and the announcement lists additional founding partners such as NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI. (cncf.io) Project documentation and vendor posts report that llm-d’s purpose is to make production deployments of large language models more predictable and efficient by giving clusters standard, model‑aware ways to route requests, share inference memory, and split model work across machines — so teams avoid bespoke schedulers and ad‑hoc hacks. (cloud.google.com) Technically, llm-d implements distributed inference: it splits inference work across multiple worker pools so different machines specialize in different tasks, and it supports “prefill/decode disaggregation” — that is, separating the prompt-processing phase that builds a cache of intermediate activations (prefill) from the token‑generation phase (decode) so each phase can be tuned independently for either throughput or responsiveness. (llm-d.ai) llm-d also adds cache-aware routing: the system tracks where key‑value (KV) cache shards — the stored activations used to speed repeated or long-context generations — live and routes requests to backends that already hold the relevant cache to avoid expensive transfers; this logic is surfaced through an Inference Gateway that extends the Kubernetes Gateway API with inference‑specific resources like InferencePool. (gateway-api-inference-extension.sigs.k8s.io) Google Cloud and early adopters published concrete performance examples from production validation: time‑to‑first‑token (the time until the model emits its first generated token) improved by about 35% in one test, p95 tail latency dropped by roughly 52% in another, and prefix cache hit rates doubled in a Vertex AI deployment after integrating llm-d routing. (cloud.google.com) The project is active upstream — the GitHub repository shows thousands of stars and frequent commits and PR activity — and the installation guides include hardware and networking expectations (examples validated on eight‑GPU H200 clusters with RDMA-capable networking and recommendations to use RDMA for KV cache transfer). (github.com; llm-d.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.