AWS pairs Cerebras + Trainium for split inference

AWS announced a collaboration to pair Cerebras silicon with Trainium chips on Bedrock, splitting inference — Trainium for prefill and Cerebras for decode — claiming up to 5× faster inference in some configurations. It’s an unusual hyperscaler pairing that fragments the inference stack across vendors. ( )

AWS said the arrangement is a multiyear collaboration that will place Cerebras wafer‑scale systems inside AWS data centers for a hosted inference offering. (businesswire.com) The companies confirmed the service will use AWS’s Elastic Fabric Adapter for low‑latency RDMA networking and will run under the AWS Nitro infrastructure for the same security and isolation guarantees as other AWS services. (businesswire.com) Cerebras’s WSE‑3 hardware that underpins the CS‑3 system contains roughly 4 trillion transistors across about 900,000 AI cores and ships with ~44 GB of on‑chip SRAM and massive memory bandwidth cited by the vendor and press. (cerebras.ai) A single CS‑3 is marketed at ~125 peak petaflops of AI performance and Cerebras describes clusters scaling to thousands of CS‑3s for exaflop‑class aggregate throughput. (businesswire.com) Cerebras has previously published production inference runs at up to ~3,000 tokens per second on large open models, and both companies said AWS will add leading open‑source LLMs and Amazon’s Nova models on the new infrastructure later this year. (cerebras.ai) Reporters and analysts called the deal a strategic move for hyperscalers to diversify hardware options and noted that financial terms were not disclosed; the announcement said the service will roll out “in the coming months,” while some outlets expect commercial availability in the second half of 2026. (money.usnews.com)

AWS pairs Cerebras + Trainium for split inference

Get your own daily briefing