Google's chip push

- Google has been selling its in-house AI chips to outside developers as demand rises for faster model inference. - Reports say the chips are being positioned as a competitive alternative to Nvidia accelerators. - The trend makes hardware-aware software skills—throughput, batching and latency trade-offs—more relevant for production backends (latimes.com).

Google is pushing its in-house artificial intelligence chips beyond its own products and into the wider cloud market as companies hunt for faster, cheaper ways to run AI models. (latimes.com) The chips are called Tensor Processing Units, or TPUs: custom processors Google built for neural-network math, the repeated matrix calculations behind tools like chatbots, image generators and recommendation systems. Google Cloud says TPUs are designed for both training models and inference, the stage when a trained model answers a prompt. (cloud.google.com) Google made its sixth-generation Trillium TPU generally available to Google Cloud customers on December 11, 2024. In that launch, Google said Trillium delivers up to 3 times higher inference throughput than TPU v5e and 67% better energy efficiency. (cloud.google.com) On April 9, 2025, Google introduced Ironwood, its seventh-generation TPU and the first one it said was designed specifically for inference. Google later said Ironwood was available to cloud customers by November 25, 2025. (blog.google, blog.google) Inference is where the AI business has shifted: training is the expensive process of building a model, while inference is the nonstop work of serving answers to users after the model is built. Bloomberg reported on April 20 that Google is likely to add new chips dedicated to inference as adoption of AI software surges. (bloomberg.com, cloud.google.com) Google is also pairing the hardware push with software tuned to squeeze more work out of each chip. In a May 9, 2025 post, Google said its JetStream inference engine and vLLM support for TPU were built to improve low-latency, high-throughput serving on large language models. (cloud.google.com) Those terms shape how AI apps feel in practice: throughput is how many tokens or requests a system can process, latency is how long one user waits, and batching is the trick of grouping requests together to keep chips busy. Google said Trillium running JetStream reached 2.9 times the throughput of TPU v5e on Llama 2 70B benchmarks. (cloud.google.com) Google is selling these chips into a market Nvidia still leads. Google Cloud’s own TPU page now pitches TPUs alongside graphics processing units, or GPUs, and Bloomberg reported that Google’s recent momentum included deals with Meta and Anthropic. (cloud.google.com, bloomberg.com) Anthropic has already expanded its TPU relationship with Google. In a November 2025 post, Anthropic said the deal was worth tens of billions of dollars and would bring well over a gigawatt of capacity online in 2026. (anthropic.com) Google’s pitch is no longer just “rent our cloud.” It is “run your model on our chips, with our networking and serving software,” a stack Google says already uses more than 100,000 Trillium chips on its Jupiter network fabric. (cloud.google.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.