Meta and Cerebras Launch High-Speed Llama API

Meta has partnered with Cerebras Systems to launch the Llama API, a service designed to deliver high-speed AI inference. The collaboration signals a growing interest in alternative silicon for AI workloads beyond the current market leaders. This move also reflects the increasing use of open-source and custom models for deployment in various applications, including at the edge.

- Cerebras's hardware is built on a "Wafer-Scale Engine" (WSE), a single chip that encompasses an entire silicon wafer. The latest WSE-3 integrates 4 trillion transistors and 900,000 AI-optimized cores, designed to minimize latency by keeping memory and compute physically close. - According to independent benchmarks, Cerebras's solution for Llama 4 can reach speeds of over 2,600 tokens per second, significantly faster than the roughly 130 tokens per second for ChatGPT. For the larger Llama 4 "Maverick" model with 400 billion parameters, Cerebras achieved 2,522 tokens per second, more than double the performance of NVIDIA's Blackwell GPUs. - The collaboration provides developers with an OpenAI-class alternative for building real-time, intelligent systems through the Llama API. This is aimed at applications that require chaining multiple large language model calls, such as interactive code generation and real-time agents, reducing execution times from minutes to seconds. - Meta's strategy to open-source its Llama models is designed to accelerate innovation and encourage widespread adoption, creating a community that can build upon and improve the models. This approach aims to establish Llama as a standard in the AI industry, similar to how Linux became a standard for operating systems. - Cerebras is also engaged in building out a network of AI supercomputers called Condor Galaxy in partnership with the UAE-based G42. The network, which will feature Cerebras's CS-3 systems, is planned to deliver a total of 16 exaFLOPs of AI computing power. - The Wafer-Scale Engine architecture contrasts with traditional GPU clusters by focusing on data parallelism, where each core on the wafer processes a slice of data simultaneously. This design avoids communication bottlenecks that can occur between multiple individual chips.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.