Inference‑first ASICs emerge
- Google and Marvell introduced inference‑focused ASICs that add a memory‑processing unit to ease TPU memory bottlenecks. - Commentary contrasts these chips with prior TPU scale notes, citing Ironwood at 4,614 TFLOPs and 192GB HBM. - The designs emphasize memory‑aware accelerators to boost inference efficiency and real‑world deployment performance. (x.com)
A new class of artificial-intelligence chips is being aimed at inference, the step where trained models answer real requests, not at training them from scratch. (theinformation.com) Google is in talks with Marvell Technology on two inference chips, including a memory processing unit that would work alongside Google’s Tensor Processing Unit, according to The Information on April 19, 2026. (theinformation.com) Inference usually runs into a memory problem before it runs into a math problem: the chip has to fetch model weights from high-bandwidth memory fast enough to keep its compute units busy. Google’s own TPU7x documentation says Ironwood uses 192 gigabytes of high-bandwidth memory per chip and a memory hierarchy built for large-scale inference. (docs.cloud.google.com) Google introduced Ironwood on April 9, 2025 as its seventh-generation Tensor Processing Unit and called it its first TPU designed specifically for the “age of inference.” The company said one Ironwood chip delivers 4,614 teraFLOPs at FP8 precision. (blog.google) That leaves a second bottleneck: moving data between memory and compute without wasting power or time. Marvell has been pitching custom high-bandwidth-memory designs since December 2024 that it says can free up 25% more area for compute, raise memory capacity by 33%, and cut memory-interface power by 70%. (marvell.com) The reported Google-Marvell design points to a split-chip approach: keep the Tensor Processing Unit doing the matrix math, and add a separate chip focused on feeding it data. That is a different emphasis from the biggest training systems of the last two years, which mostly sold on raw teraFLOPs, larger pods, and more memory stacks. (theinformation.com) Google’s recent TPU messaging already moved in that direction. Its April 2025 Ironwood announcement said the chip was built for “high-throughput, low-latency inference,” and a later Google Cloud engineering post said the hardware and software stack was co-designed for serving models such as Gemini at scale. (blog.google, cloud.google.com) The business stakes are large because inference is the part customers hit every time they type a prompt, generate an image, or call an application-programming interface. CNBC reported on April 20, 2026 that Marvell shares rose after reports it was helping Google on two new AI chips, while Broadcom shares fell nearly 2%. (cnbc.com) Google has not publicly announced the reported Marvell chips, and CNBC said Google declined to comment on the report. But the direction is clear in the public record: newer AI chips are being sold less as giant calculators and more as systems built to keep data moving. (cnbc.com, docs.cloud.google.com)