Google's TPU and scale signals

- Google announced new TPU hardware and framed inference as a distinct systems problem, splitting chips between training and inference. - Sundar Pichai said Google models now process more than 16 billion tokens per minute, up from 10 billion last quarter. - The messaging positions inference economics, routing, and governance as operational priorities for teams building production agent services (blog.google).

A token is a small chunk of text, and inference is the step where a model turns those chunks into an answer. Google said on April 22 that its models now process more than 16 billion tokens a minute. (blog.google) Sundar Pichai gave that figure at Google Cloud Next 2026 as the company introduced two new eighth-generation Tensor Processing Units, or TPUs: TPU 8t for training models and TPU 8i for running them in production. Google said the 16 billion figure was up from 10 billion tokens per minute last quarter. (blog.google) Training is the phase where a model learns patterns from large datasets; inference is the phase where a deployed model answers live prompts. Google said the infrastructure requirements for pre-training, post-training, and real-time serving have now diverged enough that it built separate systems for those jobs. (cloud.google.com) Google said TPU 8t is designed for frontier-model training, while TPU 8i is tuned for large-scale inference and reinforcement learning. In a separate product post, the company said TPU 8i links 1,152 chips in one pod, adds three times more on-chip SRAM, and is built to cut latency for “millions of agents” running at once. (cloud.google.com) (blog.google) That split extends a line Google started drawing a year earlier. At Cloud Next 2025, Google introduced Ironwood as its seventh-generation TPU and called it the first version designed specifically for inference. (blog.google) Google’s cloud division has also been building software around that hardware distinction. Its TPU inference documentation says serving is the production step where a trained model is deployed for use, and that latency service-level objectives are a priority for serving on TPU v5e and newer systems. (docs.cloud.google.com) The company tied the new chips to its broader “AI Hypercomputer” stack, which combines processors, networking, software, and data center design. Google said TPU 8t and TPU 8i are key parts of that stack and are hosted for the first time on Google’s Axion Arm-based processors. (cloud.google.com 1) (cloud.google.com 2) Google has not yet made the new eighth-generation TPUs generally available. The company said customers can request more information now ahead of general availability later in 2026. (blog.google) The immediate signal from Next was less about one benchmark than about operating at sustained volume. Google paired a new scale number — 16 billion tokens a minute — with a hardware roadmap that treats live model serving as its own engineering problem. (blog.google)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.