AI workload conformance gains urgency

Industry coverage says standardising AI workloads has become urgent, with efforts like llm‑d and CNCF conformance aimed at making model serving more portable and predictable across cloud‑native stacks. Greater conformance would reduce bespoke integration work and shift emphasis toward platform engineering and governed deployment patterns. (thenewstack.io)

Right now, running the same artificial intelligence model on two different Kubernetes setups can feel like moving a plug between outlets that look identical but are wired differently. The Cloud Native Computing Foundation has been pushing a new “Certified Kubernetes AI Conformance” program so vendors prove the basics work the same way for artificial intelligence workloads, not just for ordinary web apps. (cncf.io) Kubernetes is the open-source system companies use to place software on clusters of machines, restart it when it fails, and scale it when traffic spikes. Standard Kubernetes conformance already checks that a platform can run normal containerized software, but Google says artificial intelligence workloads add extra demands like graphics processors, low-latency networking, and stateful data pipelines that the old tests did not cover. (opensource.googleblog.com) A model server is the layer that takes a prompt, sends it to a model, and returns tokens one chunk at a time. That sounds simple until one cloud has different graphics processor drivers, another has different autoscaling behavior, and a third handles networking differently enough that the same setup breaks when moved. (thenewstack.io) The new conformance effort is basically a shared checklist for that mess. The public repository says the goal is that if an artificial intelligence application works on one conformant Kubernetes platform, it should work on another with fewer “it works on my cluster” surprises. (github.com) The checklist covers three kinds of work, and that detail matters. The Cloud Native Computing Foundation repository names training jobs, inference jobs, and agentic workloads, which means the standard is not only about building models but also about serving them live and running longer multi-step systems around them. (github.com) One of the key pieces is finer control over accelerators, which are the specialized chips used for heavy model math. Google says Dynamic Resource Allocation lets a job ask for hardware by attributes like memory size or special capabilities, instead of just asking for “one graphics processor” and hoping the cluster picks the right one. (opensource.googleblog.com) Another piece is all-or-nothing scheduling for distributed training, which is like waiting until every seat on a flight is ready before boarding the team. Google says the program requires support for tools such as Kueue so a big training job does not start half its pods, strand the rest, and burn expensive graphics processor time while waiting. (opensource.googleblog.com) That is the standards side. The implementation side is a project called llm-d, which the Cloud Native Computing Foundation accepted as a Sandbox project on March 24, 2026, after it launched in May 2025 with Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA around a simple goal: any model, any accelerator, any cloud. (cncf.io) llm-d is trying to make large language model inference a first-class Kubernetes workload instead of a pile of custom glue code. Its own site says it is built for production inference on Kubernetes with vLLM, intelligent scheduling, and key-value cache optimization, which is the trick of reusing prior computation so repeated context does not get recomputed from scratch. (llm-d.ai) Google has already wired parts of that into a product example. In its March 2026 post, Google said its Google Kubernetes Engine Inference Gateway uses the llm-d Endpoint Picker to route requests based on key-value cache hit rates, inflight requests, and queue depth, and said a Vertex AI production validation cut time-to-first-token latency by more than 35% for Qwen Coder workloads. (cloud.google.com) The urgency comes from where the money and machines are going next. The New Stack quoted Cloud Native Computing Foundation executive director Jonathan Bryce saying that by the end of 2026, about two thirds of artificial intelligence compute will be for inference rather than training, which means the hard part is shifting from building giant models once to serving them reliably every second. (thenewstack.io) If this works, platform teams stop hand-building one-off pathways for each model stack and start treating model serving more like a standard utility. The Cloud Native Computing Foundation says the program is meant to reduce fragmentation and give vendors and enterprises a common baseline, which is a dry way of saying fewer bespoke integrations and more time spent on governance, capacity planning, and deployment rules that carry across clouds. (cncf.io)

AI workload conformance gains urgency

Get your own daily briefing