AI compute squeeze

- Google unveiled an eighth-generation TPU design that splits chips for training and inference to cut serving costs at scale. - Thinking Machines Lab signed a multibillion-dollar Google Cloud deal that will use Nvidia GB300-class infrastructure for heavy workloads. - Cheaper inference aims to make always-on agent features commercially viable, reshaping which AI behaviours analytics teams can instrument (techcrunch.com) (benzinga.com).

Artificial intelligence companies are redrawing the line between building a model and running it, as Google rolls out separate chips for each job and signs up a major customer to buy more cloud capacity. (blog.google) (techcrunch.com) Training is the expensive step where a model learns from huge data sets; inference is the cheaper-looking but constant step where the model answers every prompt after launch. Google said on April 22 that its eighth-generation Tensor Processing Units split those tasks into TPU 8t for training and TPU 8i for inference. (blog.google) (cnbc.com) Google said TPU 8i is built for “high-speed inference,” while TPU 8t is aimed at “massive model training,” a design the company tied to the heavier back-and-forth workloads of AI agents. CNBC reported the new inference chip carries 384 megabytes of static random access memory, or SRAM, triple the amount in Ironwood. (blog.google) (cnbc.com) Google is still buying heavily from Nvidia while pushing its own silicon. TechCrunch reported on April 22 that Thinking Machines Lab, the startup founded by former OpenAI chief technology officer Mira Murati, signed a multibillion-dollar Google Cloud deal that includes access to Nvidia GB300 systems. (techcrunch.com) (siliconangle.com) SiliconANGLE reported the startup will use Google Cloud A4X Max instances, with each virtual machine exposing four Nvidia Blackwell Ultra graphics processing units and two 72-core central processing units. That arrangement shows how cloud vendors are mixing custom chips and Nvidia hardware instead of forcing customers onto one stack. (siliconangle.com) The cost pressure is shifting from one-time training runs to nonstop serving. Google said the new chips are built for the “agentic era,” and three weeks earlier it introduced Flex and Priority tiers in the Gemini application programming interface to let developers trade off price, latency, and reliability for inference jobs. (blog.google 1) (blog.google 2) That matters for products that keep a model running in the background, retry steps, call tools, and wait for user input. Each extra turn adds inference demand, so shaving serving costs can decide whether an always-on assistant is a premium feature or a default one. (blog.google) (benzinga.com) Google has been moving toward this split for months. Its seventh-generation Ironwood chips were already positioned for large-scale training and inference, but the company is now breaking those roles into separate products instead of asking one design to cover both. (docs.cloud.google.com) (blog.google) Bloomberg reported on April 20 that outside developers, including some of Google’s rivals, have been lining up for Google’s AI chips as demand for accelerator capacity stays tight. The new TPU rollout and the Thinking Machines contract both point to the same constraint: compute is still scarce enough that buyers are reserving multiple paths to get it. (bloomberg.com) (techcrunch.com) The immediate contest is not only whose chip is fastest, but whose cloud can supply enough training capacity and cheap enough inference to keep AI services running all day. Google’s answer this week was to sell both. (blog.google) (techcrunch.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.