Cerebras CEO: big inference claim

Cerebras CEO Andrew Feldman tweeted that NVIDIA would need 2,000+ chips to run a 2T‑parameter model at 1k tokens/sec, while Cerebras can do it today with 20 wafers and 'thousands' of tokens/sec — a head‑to‑head inference efficiency claim. The post is positioned as competitive messaging ahead of broader cloud inference plays. (x.com)

Cerebras publicized a 2,000 tokens-per-second inference run of the open-source K2 Think model on its Cerebras Inference platform in a Sept. 10, 2025 press release with MBZUAI and G42. ((cerebras.ai)) The company also published an Artificial Analysis benchmark claiming 2,522 output tokens/sec on Meta’s 400B-parameter Llama 4 Maverick model. ((businesswire.com)) Cerebras says its WSE‑3 wafer-scale chip packs roughly 4 trillion transistors, 900,000 AI‑optimized cores, 44 GB on‑chip SRAM and delivers 125 PFLOPS of peak AI performance powering the CS‑3 systems. ((cerebras.ai)) In company-published comparisons, Cerebras claims its CS‑3 inference service outperforms NVIDIA’s DGX B200 Blackwell system by wide multiples, citing a 21× inference speed advantage and lower cost and power per request. ((cerebras.ai)) AWS and Cerebras announced a multi‑month collaboration to make Cerebras hardware available via Amazon Bedrock, pairing AWS Trainium servers with Cerebras CS‑3 nodes and Elastic Fabric Adapter networking in a disaggregated inference architecture. ((aboutamazon.com)) Cerebras closed a $1.1 billion Series G in September 2025 and has outlined plans for new data centers and thousands of CS‑3 systems to scale high‑speed inference capacity (the company and partners have cited multi‑million token‑per‑second aggregate targets). ((cerebras.ai)) Public demos and third‑party writeups report real‑world uses of Cerebras infrastructures—an OpenAI demo and coverage show a Cerebras‑backed “Spark” variant running at ~1,000 tokens/sec in at least one coding‑model scenario. ((servethehome.com))

Cerebras CEO: big inference claim

Get your own daily briefing