Google's TPU Inference Push
- Google is accelerating its AI‑chip push, positioning TPUs as alternatives to Nvidia GPUs for inference workloads. - Industry reports say Google plans TPU 8t for training and TPU 8i optimised specifically for inference. - That inference focus reshapes robotics economics toward efficient, low‑latency edge compute and persistent orchestration stacks. (digitimes.com)
Google used its Cloud Next conference in Las Vegas on April 22 to unveil two separate eighth-generation Tensor Processing Units, or TPUs: TPU 8t for training AI models and TPU 8i for running them after training. (cloud.google.com) A TPU is Google’s in-house AI chip, built for the math behind models. “Inference” is the stage when a trained model answers a prompt, labels an image, or decides what a robot should do next. (docs.cloud.google.com) Google said the split reflects a change in AI workloads: pre-training, post-training, and real-time serving now have different bottlenecks. TPU 8t is aimed at frontier-model training, while TPU 8i is aimed at large-scale inference and reinforcement learning. (cloud.google.com) That is a sharper break from Google’s recent lineup than it first appears. In April 2025, Google introduced Ironwood as its seventh-generation TPU and called it its first chip designed specifically for inference. (blog.google) Ironwood is now the latest TPU generally described in Google Cloud documentation, where TPU7x is listed as the newest cloud TPU and a 9,216-chip pod product for large-scale training and decode-heavy inference. Each chip has 192 GiB of high-bandwidth memory and peak FP8 compute of 4,614 teraflops. (docs.cloud.google.com) Google’s new TPU 8t pushes the training side further upmarket. Google said it scales to 9,600 chips in a single superpod and is tuned for massive pre-training runs and embedding-heavy workloads. (cloud.google.com) The company tied the new chips to “world models” and agent systems that simulate environments and carry out long chains of reasoning. Google said the eighth-generation TPU family is meant to cover “the full AI lifecycle,” from first-token training to multi-turn serving. (cloud.google.com) That emphasis puts more weight on the serving side of AI economics, where latency targets and hardware utilization decide whether an application is affordable to run. Google’s TPU inference documentation says latency service-level objectives are a priority for serving, and supports inference on TPU v5e and newer chips. (docs.cloud.google.com) It also puts Google more directly against Nvidia in the market for chips that run models, not just train them. Nvidia has spent the past year pushing Blackwell for inference efficiency in data centers and Thor modules for robots and industrial systems at the edge. (developer.nvidia.com) (edge-ai-vision.com) For robotics, the distinction is practical. A robot needs fast, repeated inference close to the machine for perception and control, while larger training jobs can stay in centralized clusters. Nvidia says Jetson AGX Thor delivers up to 2,070 FP4 teraflops within a 130-watt envelope for those edge workloads. (edge-ai-vision.com) Google has not announced a robot module here; its TPU roadmap is still framed around Google Cloud and its AI Hypercomputer stack. But by separating training silicon from inference silicon, Google is signaling that the biggest fight in AI infrastructure is shifting from building models to running them cheaply, quickly, and at scale. (cloud.google.com) (www.googlecloudevents.com)