AWS + Cerebras speedup

Martin Szerment flags AWS integrating Cerebras CS‑3 into Bedrock to get roughly 5× token throughput by splitting prefill and decode stages—meaning cheaper, faster LLM inference at scale. The note is part of a wider thread on inference misalignment and throughput engineering. (x.com)

AWS and Cerebras announced a collaboration on March 13, 2026 to deploy Cerebras CS‑3 systems inside AWS data centers and make them available through Amazon Bedrock. (press.aboutamazon.com) The joint design explicitly disaggregates LLM inference into a parallel "prefill" stage run on AWS Trainium servers and a serial, memory‑bandwidth‑heavy "decode" stage run on Cerebras CS‑3 hardware, with Amazon’s Elastic Fabric Adapter (EFA) used to move the prefill key/value cache between systems. (press.aboutamazon.com) Cerebras says the CS‑3 is powered by its WSE‑3 wafer‑scale engine with roughly 4 trillion transistors and about 900,000 AI cores, storing the model on ~44 GB of on‑chip SRAM and delivering on‑chip memory bandwidth in the tens of petabytes per second. (cerebras.ai) AWS says the Trainium+CS‑3 Bedrock offering will roll out "in the coming months" and that Bedrock endpoints running leading open‑source LLMs and Amazon’s Nova models on Cerebras hardware will arrive later in 2026. (press.aboutamazon.com) AWS framed the result as significantly faster inference — calling it "an order of magnitude faster" — and positioned AWS as the first cloud provider to offer Cerebras’s disaggregated inference configuration via Bedrock. (press.aboutamazon.com) AWS’s Trainium3 UltraServers (AWS’s 3nm Trainium generation) are already generally available and are cited as the Trainium platform that will handle the prefill workloads for Bedrock; AWS has said Bedrock is already serving production workloads on Trainium3. (aboutamazon.com)

AWS + Cerebras speedup

Get your own daily briefing