AWS + Cerebras for wafer-scale inference

AWS announced a collaboration with Cerebras to offer open-source LLMs and Amazon Nova models on Cerebras wafer-scale hardware later this year, promising large gains in inference speed and cost for cloud-hosted models, the company said announced. That could shift workload placement decisions for enterprises balancing cost, latency, and model capability.

AWS and Cerebras announced the collaboration on March 13, 2026, framing the work as a cloud deployment that pairs Amazon Trainium with Cerebras inference appliances. press.aboutamazon.com The design routes the prefill stage to AWS Trainium and the decode stage to Cerebras CS‑3 units powered by the wafer‑scale engine (WSE‑3), an explicit split the companies say targets decode throughput. cerebras.ai Cerebras and coverage from trade press reported the CS‑3/WSE‑3 can drive “several thousand tokens per second” on decode workloads, a metric pitched for interactive apps such as coding assistants and chat interfaces. siliconangle.com AWS’s statement says the systems will be networked with Elastic Fabric Adapter (EFA) and surfaced via Amazon Bedrock, and several outlets described the agreement as a multiyear partnership — while Bloomberg and others warned the Trainium→WSE handoff can create cross‑device communication overhead that needs measurement in production. press.aboutamazon.com

AWS + Cerebras for wafer-scale inference

Get your own daily briefing