PyTorch on TPUs improved

Google published a PyTorch-native backend for TPUs that lets many PyTorch workloads run with minimal code changes, shifting the barrier to TPU adoption lower (x.com). Their engineering write-up claims a Fused Eager mode can produce about 50–100% or more performance uplift for scalable training scenarios, which could change cost/perf tradeoffs for cloud ML infra (x.com).

Training a model on a graphics processing unit usually feels like driving a car with instant steering: you write a line in PyTorch, and the chip does it right away. Training on a Tensor Processing Unit has often felt more like mailing a route to a dispatcher first, because the program had to be bundled up and compiled before the hardware ran it. (docs.pytorch.org) That extra compile step came from PyTorch XLA, the bridge layer that has been the standard way to run PyTorch on Google’s Tensor Processing Units for years. Its docs say Tensor Processing Unit programs default to “lazy” execution, which means operations build a graph first and only run after an explicit sync point. (docs.pytorch.org, github.com) Lazy execution helped Google scale training across very large Tensor Processing Unit systems, but it also made debugging and day-to-day coding feel less like normal PyTorch. In an October 20, 2025 design proposal, the PyTorch XLA team said that reliance on lazy tensors and calls like `xm.mark_step` created a developer experience that felt distinct from native PyTorch. (github.com) Google’s new answer is called TorchTPU, and it was published on April 7, 2026 as a PyTorch-native backend for Tensor Processing Units. Google says the goal is that a developer can take an existing PyTorch script, switch initialization to `tpu`, and run the training loop without changing the core model logic. (developers.googleblog.com) The key phrase in that post is “eager first.” That means the software tries to preserve the normal PyTorch habit of dispatching operations immediately, then uses the XLA compiler underneath when it finds chunks of work worth compiling for speed. (developers.googleblog.com, github.com) Google’s 2025 proposal described this as deferred execution: the system can compile a single operation, a short sequence, a fused cluster, or an entire forward and backward pass, and it can do that compilation asynchronously while other work keeps running. That is the software equivalent of a kitchen cooking one dish now while prepping the full banquet in the background. (github.com) The hardware side explains why Google cares so much about this. Google says modern machine learning training now stretches across distributed systems with on the order of 100,000 accelerators, and its Tensor Processing Unit systems connect chips through an Inter-Chip Interconnect arranged in two-dimensional or three-dimensional torus networks to avoid ordinary network bottlenecks. (developers.googleblog.com) Inside each Tensor Processing Unit chip, Google says dense matrix math runs on TensorCores while irregular jobs like embeddings and gather-scatter work run on SparseCores. A backend that feels like ordinary PyTorch but still knows how to feed those specialized units efficiently is what Google is trying to build here. (developers.googleblog.com) Google’s engineering write-up says a mode it calls Fused Eager can deliver roughly 50 to 100 percent or more performance uplift in scalable training scenarios. If those gains hold outside Google’s internal tests, the practical effect is that teams already writing models in PyTorch may no longer need a major rewrite to compare Tensor Processing Units against graphics processing units on price and throughput. (developers.googleblog.com) That last part is the real shift. PyTorch is the default language of a huge share of machine learning research and production, and Google is trying to make Tensor Processing Units feel less like a special destination and more like just another device target, the way `cuda` became shorthand for Nvidia hardware in everyday model code. (developers.googleblog.com, github.com)

PyTorch on TPUs improved

Get your own daily briefing