Cerebras Claims Llama Training Beats GPUs
Cerebras is touting its wafer-scale compute as a superior alternative for training large models, demonstrating that its WSE-3 engine clusters can train Llama models faster and with lower power than traditional GPU clusters. The company is targeting hyperscalers by emphasizing simpler scaling without the need for model parallelism.
The Cerebras WSE-3 (Wafer-Scale Engine) is a single chip built from an entire 12-inch silicon wafer, containing 4 trillion transistors and 900,000 AI-optimized cores. Its key architectural difference from GPUs is the 44 gigabytes of on-chip SRAM, providing 21 petabytes/s of memory bandwidth—a strategy designed to eliminate the latency from fetching model parameters from off-chip HBM memory. In recent benchmarks, a four-node cluster of Cerebras CS-3 systems running the Llama 3.2 70B model achieved 2,100 tokens per second. This performance is cited as being 8x to 22x faster than cloud-based eight-way NVIDIA H100 GPU instances on the same task. The company attributes this speed-up to its ability to keep the entire model's weights on-chip, avoiding the memory bottlenecks that GPUs face. Cerebras is gaining commercial traction through strategic partnerships, most notably a multi-year deal to deploy 750 megawatts of its wafer-scale systems for OpenAI's inference workloads. Other key customers and partners include G42 in a $100 million deal for AI supercomputers, Meta for powering Llama models, and Hugging Face to provide developers with API access to its high-speed inference. The AI accelerator market, which accounted for roughly 20% of the total semiconductor market in 2024, is intensely competitive. While NVIDIA remains the dominant player, companies like Groq and SambaNova are also vying for market share, particularly in the inference space. A major dynamic is the "build vs. buy" decision faced by hyperscalers like Google, AWS, and Meta, who are both major chip customers and developers of their own custom silicon. This "build vs. buy" decision is reshaping datacenter infrastructure. Hyperscalers are making multi-hundred-billion-dollar capital expenditure commitments to build out AI-specific data centers, including custom chips. However, the complexity and time required to build custom solutions from scratch mean many still lease capacity or purchase specialized hardware from vendors for specific needs, creating an opening for focused players like Cerebras. The economics of AI are bifurcated between training and inference. Training large models is a massive capital expense, with the cost of training a model like Llama 3.1 estimated to be between $92 million and $123 million. While Cerebras targets this with claims of faster training, its recent GTM focus has heavily emphasized inference, where low latency and high throughput can reduce the significant operational costs that accumulate over a model's lifetime.