Inference chip pivot

- The AI hardware race is shifting focus from training scale to inference cost and custom silicon. - Google is reportedly in talks with Marvell to build inference chips while rivals and Nvidia push dedicated inference designs. - That change makes compute procurement and deployment architecture strategic constraints for research teams (thenextweb.com, spacedaily.com, digitaltoday.co.kr, prnewswire.com).

AI chips are being redesigned for the part users actually touch: answering prompts, generating tokens, and doing it cheaply millions of times a day. (cloud.google.com, money.usnews.com) Alphabet’s Google is in talks with Marvell Technology on two new chips for that job, according to a Reuters report on April 19 citing The Information. One would be a memory processing unit that works alongside Google’s Tensor Processing Unit, and the other would be a new TPU built specifically to run AI models. (money.usnews.com) The report said Google and Marvell had not signed a contract and did not immediately respond to requests for comment. It also said the companies aim to finalize the memory chip’s design as soon as 2027 before moving to test production. (money.usnews.com) Inference is the execution phase of artificial intelligence: a trained model takes new input and produces an answer. A single response is lighter than training, but serving millions of responses in real time turns latency, power use, and memory bandwidth into the main engineering problem. (cloud.google.com, nvidia.com) Google has already been moving its TPU line in that direction. Google Cloud says its seventh-generation Ironwood TPU is “custom built for high-volume low-latency AI inference and model serving,” while its earlier Trillium chips were pitched for both training and serving with lower latency and lower cost than TPU v5e. (blog.google.com, cloud.google.com) Nvidia is making the same sales pitch from the other side of the market. Its Blackwell inference page says the GB300 NVL72 system delivers 35 times lower cost per token than Hopper and 50 times more tokens per watt, putting token economics at the center of the hardware battle. (nvidia.com) Google’s chip talks also land two weeks after Broadcom disclosed a long-term agreement to develop and supply future generations of Google’s custom artificial intelligence chips and next-generation AI rack components through 2031. That means Google is exploring a new partner while keeping its existing TPU pipeline in place. (money.usnews.com, money.usnews.com) Marvell has been courting that market openly. On its investor-event page, the company says its platform is aimed at “the next generation of custom AI infrastructure” and highlights the “growing opportunity for custom silicon.” (marvell.com) For research teams and cloud buyers, that shifts the bottleneck from raw training scale to procurement and deployment design. The practical question is no longer only who can train the biggest model, but who can afford to serve it at useful speed, inside power and budget limits, once real users arrive. (cloud.google.com, nvidia.com, blog.google.com)

Inference chip pivot

Get your own daily briefing