GKE Inference Gateway
Google Cloud introduced GKE Inference Gateway to run real-time and asynchronous AI inference on the same infrastructure (x.com). Google published a blog alongside the announcement describing modern AI service patterns for using the Gateway (x.com).
Google Cloud has introduced GKE Inference Gateway, a routing layer for Google Kubernetes Engine that lets companies run real-time and asynchronous artificial intelligence inference on the same pool of graphics processing units and tensor processing units. (cloud.google.com) Inference is the step where a trained model answers a prompt, and Google says modern services now split that work between two patterns: real-time requests such as chat, and asynchronous jobs such as document indexing or product categorization. The new gateway is meant to schedule both on shared infrastructure instead of separate clusters. (cloud.google.com) Google’s documentation describes GKE Inference Gateway as an extension to the GKE Gateway that adds routing and load balancing tuned for generative artificial intelligence workloads. It uses signals from model servers including key-value cache hits, graphics processing unit or tensor processing unit utilization, and request queue length. (docs.cloud.google.com) Those signals matter because large language model requests are not like ordinary web traffic. The Kubernetes project said on June 5, 2025 that inference sessions are often long-running, resource-intensive, and partly stateful, which makes round-robin or path-based load balancing a poor fit. (kubernetes.io) Google says the gateway can route requests with shared prompt prefixes back to the same accelerator so the model can reuse cached work instead of recomputing it. In its September 10, 2025 general-availability post, Google said that prefix-aware load balancing improved time-to-first-token latency by up to 96% at peak throughput for prefix-heavy workloads. (cloud.google.com) The product also supports dynamic Low-Rank Adaptation, or LoRA, serving, which lets multiple fine-tuned variants share one base model and one accelerator. Google says that setup can reduce the number of graphics processing units and tensor processing units needed by packing several adapters onto common hardware. (docs.cloud.google.com) Under the hood, the design follows the Gateway Application Programming Interface model that Kubernetes has been pushing as a newer networking standard. The Kubernetes project’s inference extension adds custom resources called InferencePool and InferenceModel so platform teams can manage where models run while application teams manage which model endpoint is exposed. (kubernetes.io) Google’s deployment guide shows the feature is aimed at production operators, not just experiments. The setup requires Google Kubernetes Engine, Compute Engine, and Network Services application programming interfaces, and Google’s tutorial example calls for quota for Nvidia H100 graphics processing units plus access approval for Meta’s Llama 3.1 model on Hugging Face. (docs.cloud.google.com) Google has been building outward from the initial gateway launch. On March 17, 2026, the company previewed a multi-cluster version that can route inference traffic across regions and clusters, with policies that use real-time metrics such as key-value cache utilization to pick backends. (cloud.google.com) The immediate pitch is simpler: keep chat requests fast, keep batch jobs moving, and stop leaving expensive accelerators stranded in separate silos. Google is betting that inference infrastructure will look less like a web load balancer and more like a traffic controller for shared model compute. (cloud.google.com)