AWS + Cerebras for inference

AWS and Cerebras announced a disaggregated architecture combining Trainium processors with Cerebras wafer‑scale systems on Amazon Bedrock to deliver faster cloud inference (hpcwire.com) and discussed on social channels (x.com). The partnership is being positioned as a high‑throughput inference alternative to traditional GPU stacks.

AWS and Cerebras signed a multiyear collaboration that AWS framed as making it the first cloud provider to offer Cerebras’s disaggregated inference solution exclusively on Amazon Bedrock announced)). The joint design explicitly splits inference into two stages—“prefill” on AWS Trainium and “decode” on Cerebras CS‑3/WSE‑3—connected using AWS Elastic Fabric Adapter (EFA) networking described)). Cerebras’s own blog says the disaggregated stack delivers roughly 5× more high‑speed token capacity in the same hardware footprint, a metric the company used to quantify decode throughput improvements claimed)). The wafer‑scale WSE‑3 hardware powering CS‑3 contains ~4 trillion transistors across about 900,000 AI cores and ships with ~44 GB of on‑chip SRAM plus memory bandwidth in the tens of petabytes/sec, supporting the company’s 125 petaFLOPS per CS‑3 performance claim. documented)) AWS’s announcement states the Trainium+CS‑3 service will appear on Amazon Bedrock “in the coming months,” with leading open‑source LLMs and Amazon Nova planned to run on Cerebras hardware “later this year.” stated)) Cerebras argued the push targets high‑token workloads such as “agentic coding,” which the company estimates produces about 15× more tokens per query than conversational chat, and used that figure to justify decode‑focused acceleration in the joint solution noted)).

AWS + Cerebras for inference

Get your own daily briefing