AWS taps Cerebras for Bedrock speed
AWS announced a collaboration with Cerebras to bring high‑speed, disaggregated inference to Amazon Bedrock, aiming to lower latency and boost throughput for generative workloads across AWS data centers reported. The move signals more vendor diversity in production model serving and creates opportunities to abstract hardware choices in inference stacks.
AWS’s March 13, 2026 press release details) that the deployed stack will combine AWS Trainium‑powered servers, Cerebras CS‑3 systems (WSE family), and Elastic Fabric Adapter (EFA) networking inside AWS data centers. The partners define “inference disaggregation” as a split where Trainium handles the parallel, compute‑bound prefill stage and the Cerebras CS‑3/WSE handles the serial, memory‑bandwidth‑heavy decode stage, with the KV cache transferred over EFA. (press.aboutamazon.com) Cerebras claims) the disaggregated configuration delivers 5× more high‑speed token capacity in the same hardware footprint and that Cerebras‑powered models can reach up to ~3,000 tokens/sec, while AWS characterizes) the overall system as delivering “an order of magnitude faster” inference. Amazon and Cerebras say) the service will be available through Amazon Bedrock in the coming months and that leading open‑source LLMs plus Amazon’s Nova family are planned to run on Cerebras hardware “later this year.” Multiple outlets report the deployment will run on AWS’s Nitro platform to preserve established security and isolation guarantees for Trainium instances and CS‑3 systems in AWS data centers. Cerebras documents that productionizing this split requires moving the KV cache from prefill nodes to decode nodes over a low‑latency, high‑bandwidth fabric (EFA), which implies Bedrock and orchestration layers must handle KV routing, model placement, and affinity policies in the control plane.