Cerebras' SRAM Bandwidth Claim
Cerebras’ CEO is touting its SRAM wafer‑scale chips as delivering ~2,600x more memory bandwidth than NVIDIA Blackwell and claiming up to 15x faster token generation for inference. The company is publicizing partnerships (Perplexity, CoreWeave) to push the wafer‑scale inference narrative vs. GPUs. (x.com) (communicationstoday.co.in)
Cerebras’ WSE‑3 wafer‑scale chip is specified at 4 trillion transistors, about 900,000 AI‑optimized cores, roughly 44 GB of on‑chip SRAM and a peak of ~125 petaFLOPS of FP16 compute. (businesswire.com)) Cerebras publishes an on‑chip SRAM bandwidth figure of roughly 21 petabytes/sec for the WSE‑3 and external analysts have contrasted that with DGX B200/HBMe GPU memory bandwidth figures (around 64 TB/sec for an 8‑GPU DGX B200 configuration) when explaining the large multiples being discussed. (cerebras.ai)) Independent benchmarking cited by Cerebras shows the company’s systems exceeding ~2,500 tokens/sec on Meta’s 400B Llama 4 Maverick model while a DGX B200 Blackwell 8‑GPU configuration registered roughly 1,000 tokens/sec in the same comparison. (cerebras.ai)) Perplexity’s “Sonar” search launch credits Cerebras infrastructure and cites Sonar running at about 1,200 tokens/sec on a Llama‑based 70B model, with that capability rolled into Perplexity Pro access in the vendor announcement. (businesswire.com)) CoreWeave, Cerebras and BCE unveiled plans for a purpose‑built AI campus in Sherwood, Saskatchewan, described as roughly a 300 MW facility with a C$1.7 billion investment and a first phase expected online in the first half of next year. (bloomberg.com)) Cerebras has commercialized pay‑as‑you‑go inference pricing in prior offers (examples cited at $0.10 per million tokens for an 8B model and $0.60 per million for a 70B), and the company closed a $1.1 billion Series G round at an $8.1 billion valuation in September 2025 to expand manufacturing and data‑center capacity. (pcmag.com)) That performance conversation sits alongside contrasting metrics from NVIDIA and partners: NVIDIA has published aggregate rack‑scale throughput claims (up to ~1.5 million tokens/sec on a GB200 NVL72 rack for gpt‑oss‑120B), while engineering posts from Baseten show single‑Blackwell optimizations in the ~650 tokens/sec range for GPT‑OSS 120B — underscoring differences between per‑device, per‑user and rack‑scale measurements. (blogs.nvidia.com))