Amazon and Cerebras push disaggregated inference
Amazon has partnered with Cerebras on a push toward disaggregated inference—an explicit bid to challenge NVIDIA’s monolithic GPU dominance and enable heterogeneous, composable serving across CPUs, custom ASICs and tensor processors reported. The takeaway: model serving will shift toward portable, hardware‑agnostic orchestration to optimize cost, latency and energy for production LLM workloads.
AWS announced) a Trainium + Cerebras CS‑3 deployment that will be accessible via Amazon Bedrock from AWS data centers. The pack pairs AWS Trainium‑powered servers with Cerebras CS‑3 systems using Cerebras’s WSE‑3 wafer‑scale engine and links them over Amazon’s Elastic Fabric Adapter (EFA) networking to move traffic between the two platforms said Cerebras and AWS). AWS documented) an explicit inference disaggregation: “prefill” (parallel, compute‑heavy) and “decode” (serial, memory‑bandwidth‑heavy), and stated that decode typically dominates token latency. AWS and Cerebras said) the design lets Trainium handle prefill while the CS‑3 handles decode, producing what AWS VP David Brown described as “an order of magnitude” faster inference for token generation. The companies stated) the service will launch via Amazon Bedrock “in the next couple of months,” with leading open‑source LLMs and Amazon Nova on Cerebras hardware planned “later this year.” AWS confirmed) the deployment will run on the AWS Nitro platform to provide the same isolation and operational controls as other cloud services, enabling enterprise operational models for Bedrock‑hosted inference. Both firms framed the deal as a multiyear collaboration and positioned AWS as the first cloud provider to host Cerebras’s disaggregated inference solution, according to Cerebras’s blog and AWS press materials posted March 13, 2026).