Google TPU 8t/8i Split

Published by The Daily Scout

What happened

- Google announced eighth‑generation TPUs, splitting designs into TPU 8t for training and TPU 8i for inference. - The new chips are explicitly pitched for the emerging “agentic era” of many small, interacting models. - That split changes the compute decision calculus for startups building agents, who will now weigh latency, token cost, and tooling when choosing hardware. (blog.google) (techcrunch.com)

Why it matters

Google has split its next Tensor Processing Unit in two, with one chip for training models and another for running them live in products. (blog.google.com) Google announced the eighth-generation designs, called TPU 8t and TPU 8i, on April 22 at Google Cloud Next ’26. The company said TPU 8i is built for “near-zero latency” inference, while TPU 8t is tuned for training and large memory pools. (cloud.google.com) (blog.google.com) A Tensor Processing Unit is Google’s in-house artificial intelligence chip, the same basic category of hardware that trains models and serves answers after a user prompt. Google’s TPU developer hub says Cloud TPUs support the full cycle from pre-training to production serving with JAX, PyTorch, and vLLM. (cloud.google.com) Training is the expensive phase where a model learns from huge datasets; inference is the cheaper, repeated phase where the trained model generates tokens for users. Google said those workloads have “diverged,” with pre-training, post-training, and real-time serving now hitting different bottlenecks. (cloud.google.com) Google is pitching that split around “agentic” software, where many smaller models plan, call tools, and hand work to one another instead of one model answering once. In its launch post, Google said those agents need fast response times and multi-step execution, and that TPU 8i is designed to complete those chains quickly enough for a usable product. (blog.google.com) The technical tradeoff is straightforward: a training chip is judged by how much model-building work it can push through, while an inference chip is judged by delay, throughput, and cost per request. Google said TPU 8t can scale to a 9,600-chip superpod for pre-training, and TechCrunch reported Google is claiming up to 3x faster training and 80% better performance per dollar than prior generations. (cloud.google.com) (techcrunch.com) Google also tied the new family to its broader “AI Hypercomputer” stack, which bundles chips, networking, software, and data center design. The company said the eighth-generation TPU systems are the first to use its Axion Arm-based host processors to reduce delays from data preparation and orchestration. (cloud.google.com 1) (cloud.google.com 2) The split is also a break from Google’s current TPU7x, known as Ironwood, which Google documents as a seventh-generation system for both large-scale training and inference. Ironwood runs in 9,216-chip pods and requires Google Kubernetes Engine, according to Google’s TPU7x documentation. (cloud.google.com) For startups building agents, that means the hardware decision is becoming less about buying “the best chip” and more about matching a workload to the right part of the stack. Google’s own tooling now emphasizes separate paths for training and serving, including JAX and PyTorch for model development and vLLM for low-latency inference on TPUs. (cloud.google.com) Google is not abandoning Nvidia while it pushes its own silicon. TechCrunch reported Google also plans to offer Nvidia’s Vera Rubin systems in its cloud later this year and is working with Nvidia on networking software, which leaves customers choosing between ecosystems rather than a single winner. (techcrunch.com)

Key numbers

  • Google announced eighth‑generation TPUs, splitting designs into TPU 8t for training and TPU 8i for inference.
  • (blog.google.com) Google announced the eighth-generation designs, called TPU 8t and TPU 8i, on April 22 at Google Cloud Next ’26.
  • The company said TPU 8i is built for “near-zero latency” inference, while TPU 8t is tuned for training and large memory pools.
  • In its launch post, Google said those agents need fast response times and multi-step execution, and that TPU 8i is designed to complete those chains quickly enough for a usable product.

What happens next

  • Google has split its next Tensor Processing Unit in two, with one chip for training models and another for running them live in products.
  • (blog.google.com) Google announced the eighth-generation designs, called TPU 8t and TPU 8i, on April 22 at Google Cloud Next ’26.
  • (cloud.google.com) Google is pitching that split around “agentic” software, where many smaller models plan, call tools, and hand work to one another instead of one model answering once.

Quick answers

What happened in Google TPU 8t/8i Split?

Google announced eighth‑generation TPUs, splitting designs into TPU 8t for training and TPU 8i for inference. The new chips are explicitly pitched for the emerging “agentic era” of many small, interacting models. That split changes the compute decision calculus for startups building agents, who will now weigh latency, token cost, and tooling when choosing hardware. (blog.google) (techcrunch.com)

Why does Google TPU 8t/8i Split matter?

Google has split its next Tensor Processing Unit in two, with one chip for training models and another for running them live in products. (blog.google.com) Google announced the eighth-generation designs, called TPU 8t and TPU 8i, on April 22 at Google Cloud Next ’26. The company said TPU 8i is built for “near-zero latency” inference, while TPU 8t is tuned for training and large memory pools. (cloud.google.com) (blog.google.com) A Tensor Processing Unit is Google’s in-house artificial intelligence chip, the same basic category of hardware that trains models and serves answers after a user prompt. Google’s TPU developer hub says Cloud TPUs support the full cycle from pre-training to production serving with JAX, PyTorch, and vLLM. (cloud.google.com) Training is the expensive phase where a model learns from huge datasets; inference is the cheaper, repeated phase where the trained model generates tokens for users. Google said those workloads have “diverged,” with pre-training, post-training, and real-time serving now hitting different bottlenecks. (cloud.google.com) Google is pitching that split around “agentic” software, where many smaller models plan, call tools, and hand work to one another instead of one model answering once. In its launch post, Google said those agents need fast response times and multi-step execution, and that TPU 8i is designed to complete those chains quickly enough for a usable product. (blog.google.com) The technical tradeoff is straightforward: a training chip is judged by how much model-building work it can push through, while an inference chip is judged by delay, throughput, and cost per request. Google said TPU 8t can scale to a 9,600-chip superpod for pre-training, and TechCrunch reported Google is claiming up to 3x faster training and 80% better performance per dollar than prior generations. (cloud.google.com) (techcrunch.com) Google also tied the new family to its broader “AI Hypercomputer” stack, which bundles chips, networking, software, and data center design. The company said the eighth-generation TPU systems are the first to use its Axion Arm-based host processors to reduce delays from data preparation and orchestration. (cloud.google.com 1) (cloud.google.com 2) The split is also a break from Google’s current TPU7x, known as Ironwood, which Google documents as a seventh-generation system for both large-scale training and inference. Ironwood runs in 9,216-chip pods and requires Google Kubernetes Engine, according to Google’s TPU7x documentation. (cloud.google.com) For startups building agents, that means the hardware decision is becoming less about buying “the best chip” and more about matching a workload to the right part of the stack. Google’s own tooling now emphasizes separate paths for training and serving, including JAX and PyTorch for model development and vLLM for low-latency inference on TPUs. (cloud.google.com) Google is not abandoning Nvidia while it pushes its own silicon. TechCrunch reported Google also plans to offer Nvidia’s Vera Rubin systems in its cloud later this year and is working with Nvidia on networking software, which leaves customers choosing between ecosystems rather than a single winner. (techcrunch.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Published by The Daily Scout - Be the smartest in the room.