Meta and Cerebras Launch High-Speed Llama API
Meta has partnered with Cerebras Systems to launch the Llama API, a service designed to deliver high-speed AI inference. The collaboration signals a growing interest in alternative silicon for AI workloads beyond the current market leaders. This move also reflects the increasing use of open-source and custom models for deployment in various applications, including at the edge.
- Cerebras's hardware is built on a "Wafer-Scale Engine" (WSE), a single chip that encompasses an entire silicon wafer. The latest WSE-3 integrates 4 trillion transistors and 900,000 AI-optimized cores, designed to minimize latency by keeping memory and compute physically close. - According to independent benchmarks, Cerebras's solution for Llama 4 can reach speeds of over 2,600 tokens per second, significantly faster than the roughly 130 tokens per second for ChatGPT. For the larger Llama 4 "Maverick" model with 400 billion parameters, Cerebras achieved 2,522 tokens per second, more than double the performance of NVIDIA's Blackwell GPUs. - The collaboration provides developers with an OpenAI-class alternative for building real-time, intelligent systems through the Llama API. This is aimed at applications that require chaining multiple large language model calls, such as interactive code generation and real-time agents, reducing execution times from minutes to seconds. - Meta's strategy to open-source its Llama models is designed to accelerate innovation and encourage widespread adoption, creating a community that can build upon and improve the models. This approach aims to establish Llama as a standard in the AI industry, similar to how Linux became a standard for operating systems. - Cerebras is also engaged in building out a network of AI supercomputers called Condor Galaxy in partnership with the UAE-based G42. The network, which will feature Cerebras's CS-3 systems, is planned to deliver a total of 16 exaFLOPs of AI computing power. - The Wafer-Scale Engine architecture contrasts with traditional GPU clusters by focusing on data parallelism, where each core on the wafer processes a slice of data simultaneously. This design avoids communication bottlenecks that can occur between multiple individual chips.