Chips for Inference
- Google unveiled the Ironwood TPU and previewed an eighth-generation split between training and inference chips. - Nvidia and Google announced Vera Rubin-powered A5X instances that can scale toward nearly one million Rubin GPUs. - The roadmap aims to cut inference costs and make routine, large-scale enterprise AI deployment economically viable. ( )
Artificial intelligence chips do two costly jobs: training a model and then running it for users. Google is now splitting those jobs into separate chips as it chases cheaper, larger-scale AI deployment. (blog.google) At Google Cloud Next in April 2025, Google introduced Ironwood, its seventh-generation Tensor Processing Unit, as its first TPU designed specifically for inference, the stage when a trained model generates answers, images, or actions. Google said Ironwood delivers 4,614 teraFLOPS per chip, 192 gigabytes of high-bandwidth memory, and scales to 9,216 chips with 42.5 exaFLOPS of compute. (blog.google) On April 22, 2026, Google previewed an eighth-generation split: TPU 8t for large-model training and TPU 8i for inference and reinforcement learning. Google said both chips are built for what it calls the “agentic era,” when models make repeated calls to tools and other models instead of answering in a single pass. (blog.google) That distinction tracks how AI systems spend money. Training is the one-time process of teaching a model from huge datasets, while inference is the repeated, day-to-day work of serving every prompt, search result, code suggestion, or software agent action. (cloud.google.com) Google has spent more than a decade building Tensor Processing Units for its own services and for Google Cloud customers, but earlier generations were aimed at both training and serving. Ironwood and the new TPU 8 roadmap move closer to a division of labor that mirrors how cloud customers actually buy compute: one pool for building models, another for running them at scale. (blog.google; blog.google) Google is not betting only on its own silicon. Nvidia and Google Cloud said on April 22, 2026 that they will offer Google Cloud A5X instances powered by Nvidia Vera Rubin systems, with designs that can scale toward nearly 1 million Rubin graphics processing units, or GPUs. (blogs.nvidia.com) The same Nvidia announcement paired those A5X systems with Google’s Gemini models on Google Distributed Cloud, confidential computing on Nvidia Blackwell GPUs, and enterprise agent tools built with Gemini Enterprise Agent Platform, Nvidia Nemotron, and Nvidia NeMo. The pitch was not a single chip, but a stack of hardware, software, and security features for companies building what Nvidia calls AI factories. (blogs.nvidia.com) Google made a similar point in its AI Hypercomputer materials, which describe cloud AI as an integrated system rather than a standalone accelerator. In that model, chip choice matters because the economics of serving millions of prompts depend on networking, memory, software, and how efficiently work is scheduled across clusters. (cloud.google.com) The competitive backdrop is straightforward: Google is expanding custom TPUs for customers that want tightly integrated infrastructure, while also deepening its partnership with Nvidia for buyers that want the latest GPU roadmaps. That gives Google Cloud two answers to the same enterprise question of 2026 — how to run more AI, more often, without letting inference bills swamp the business case. (blog.google; blogs.nvidia.com) The next test is whether those specialized chips actually turn AI from a premium project into routine infrastructure. Google’s roadmap says the future market is not just about training the biggest models, but about serving them cheaply enough that companies keep them on all the time. (blog.google; blog.google)