Google targets inference chips

- Google is developing inference-focused AI chips with Marvell and broader TPU access beyond Google Cloud. - Reports say Google plans on-prem deployment options and wider PyTorch support to reduce adoption friction. - The effort reflects a market shift from headline model training to efficient, deployable inference hardware and software integration. (qz.com)

Google is developing new AI chips with Marvell that are aimed at inference, the step where a trained model answers a prompt in real time. (reuters.com) Reuters reported on April 19 that Google is discussing two chips with Marvell Technology, citing The Information: a memory processing unit to work alongside Google’s Tensor Processing Units, and a new TPU tuned for inference. Reuters said the talks would add Marvell to a supplier lineup that already includes Broadcom and MediaTek. (reuters.com) Inference is the expensive part of artificial intelligence once a model is live: every chatbot reply, search summary, or coding suggestion has to run on chips in a data center. Google said when it introduced Ironwood at Cloud Next in April 2025 that it was the company’s first TPU designed specifically for inference. (blog.google) Google has spent years using TPUs inside its own services and renting them through Google Cloud, but it has also been trying to make them easier for outside developers to adopt. Google’s Cloud TPU documentation says Trillium, its sixth-generation TPU, is available on Google Cloud as TPU v6e, and Google’s April 7 developer post introduced TorchTPU as a way to run PyTorch natively on TPU infrastructure. (cloud.google.com) (docs.cloud.google.com) (developers.googleblog.com) That software push matters because Nvidia’s grip on artificial intelligence is not just about chips; it is also about the tools developers already use. Google said TorchTPU follows an “eager first” design for PyTorch workloads, while PyTorch’s own XLA documentation describes Google Cloud TPUs as accelerators for both training and inference. (developers.googleblog.com) (docs.pytorch.org) The hardware side is moving the same way. Google said Ironwood is “custom built for high-volume low-latency AI inference and model serving,” and its Cloud documentation says TPU7x is the first release in the Ironwood family. (blog.google) (docs.cloud.google.com) Google is also telling customers that inference cost and portability are now central buying questions. A Google Cloud Next 2026 session description promised discussion of “TPU inference” with vLLM and other frameworks, including the tradeoffs between latency, throughput, and moving workloads between graphics processors and TPUs. (googlecloudevents.com) Marvell’s role fits its recent strategy in custom silicon. CNBC reported on April 20 that Marvell shares rose after the report, while Broadcom shares fell, even as Google’s existing Broadcom relationship remained in place. (cnbc.com) Google has not publicly confirmed a Marvell deal, and Reuters said the discussions were still talks, not a signed agreement. But the direction is clear: Google is building more of its AI stack around serving models cheaply and quickly, not just training bigger ones. (reuters.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.