Leaked TPU v9 notes

Leaked notes on Google's TPU v9 describe 3D stacking aimed at hyperscale inference and name Broadcom and Intel as ASIC partners, while flagging SRAM buffering as a challenge for large‑batch AI workloads. (x.com) The leak frames v9 as tuned for inference at scale with memory buffering as a key bottleneck to watch. (x.com)

A Tensor Processing Unit is Google’s custom artificial intelligence chip, built to do the matrix math behind models faster than a general-purpose processor. Google’s own documentation says these chips are application-specific integrated circuits tuned for machine learning workloads. (cloud.google.com) Google’s current public flagship, Ironwood, shows where the roadmap has already moved: inference first. Google introduced the seventh-generation chip on April 9, 2025 as its first Tensor Processing Unit designed specifically for inference, not just training. (blog.google) Inference is the stage where a trained model answers prompts, ranks results, or generates tokens for users. Google said Ironwood scales to 9,216 chips and delivers 42.5 exaflops in a full pod, aimed at what it called “thinking” models and large-scale serving. (blog.google) That matters for reading any leaked Tensor Processing Unit v9 notes, because the public roadmap already points toward bigger systems built for serving models at cloud scale. Google’s TPU7x documentation says Ironwood has 192 gibibytes of high-bandwidth memory per chip and is built for dense models, mixture-of-experts models, pre-training, sampling, and decode-heavy inference. (cloud.google.com) The same documentation also spells out the memory problem behind the leak. Google says high-bandwidth memory can still bottleneck memory-bound operations, while the faster on-chip static random-access memory buffer, called vector memory, is small enough that buffer size has to be tuned carefully. (cloud.google.com) Put simply, the chip can multiply numbers very quickly, but it still loses time if data cannot be staged close enough to the compute blocks. That is why leaked references to static random-access memory buffering on large-batch workloads fit the bottlenecks Google has already described in public for its current inference hardware. (cloud.google.com) The packaging piece also lines up with broader industry moves. TrendForce reported on November 25, 2025 that Google planned to implement Intel’s Embedded Multi-die Interconnect Bridge in its 2027 Tensor Processing Unit v9, as cloud providers looked for larger packages and alternatives to Taiwan Semiconductor Manufacturing Company’s Chip-on-Wafer-on-Substrate capacity constraints. (trendforce.com) TrendForce said Embedded Multi-die Interconnect Bridge can support larger effective package scaling and lower cost by removing the large interposer used in Chip-on-Wafer-on-Substrate. The same report also said the tradeoff is lower bandwidth and slightly higher latency, which helps explain why packaging choices are central to any inference-first chip design. (trendforce.com) Broadcom’s role is less speculative. Broadcom disclosed in April 2026 that it had a long-term agreement with Google to develop and supply future generations of Google’s custom artificial intelligence chips through 2031, extending a partnership that has underpinned multiple Tensor Processing Unit generations. (thegputrade.com) Intel, for its part, announced on April 9, 2026 that it had expanded its collaboration with Google on custom application-specific integrated circuit-based infrastructure processing units, alongside continued Xeon central processing unit deployments across Google Cloud. Intel’s statement did not mention Tensor Processing Unit v9 directly, but it confirmed a broader multiyear Google-Intel infrastructure relationship. (intel.com) So the leaked Tensor Processing Unit v9 notes land in a hardware race that is already shifting from raw training power to serving models cheaply, quickly, and in huge volumes. If the leak is accurate, the real question is not whether Google wants more inference capacity, but whether packaging and memory buffering can keep up with the scale it is already publicly chasing. (blog.google)

Leaked TPU v9 notes

Get your own daily briefing