Cerebras opens free AI API

- Cerebras launched a free tier for its inference API in May 2026, giving developers access to several open models through its cloud service. - Cerebras says the free plan includes 1 million tokens a day, with model-specific request caps and no paid commitment to start. - On May 27, 2026, Cerebras says Llama 3.1 8B and Qwen 3 235B Instruct will be deprecated.

Cerebras has opened a free tier for its AI inference API, expanding a cloud service it has been pitching on speed and low-friction access for developers. The company’s pricing page says the free plan provides access to “all Cerebras powered models,” while its documentation lists current public-endpoint models including OpenAI GPT OSS, Z.ai’s GLM 4.7, Meta’s Llama 3.1 8B and Qwen 3 235B Instruct. Cerebras’ quickstart guide says developers can generate a free API key and begin making requests through an OpenAI-compatible chat completions format. The company’s public materials do not show a credit-card requirement for the free key flow. ### Which models are actually included in the free API? Cerebras’ model catalog lists four models on its shared public endpoints: `gpt-oss-120b`, `llama3.1-8b`, `qwen-3-235b-a22b-instruct-2507` and `zai-glm-4.7`. The pricing page separately markets those same families on its developer tier, with listed speeds of about 3,000 tokens per second for GPT OSS 120B, 2,200 for Llama 3.1 8B, 1,400 for Qwen 3 235B Instruct and 1,000 for GLM 4.7. (cerebras.ai) The company also flags some churn in that lineup. Cerebras says Llama 3.1 8B and Qwen 3 235B Instruct will be deprecated on May 27, 2026, while GLM 4.7 and GPT OSS 120B have seen temporary free-tier rate-limit reductions because of demand. ### How much can a developer use before hitting the cap? (inference-docs.cerebras.ai) Cerebras’ rate-limit page says the free tier is governed by both request counts and token counts, with whichever limit is reached first stopping additional calls. For GPT OSS 120B, Llama 3.1 8B and Qwen 3 235B Instruct, the page lists 1 million tokens per hour and 1 million tokens per day, alongside 30 requests per minute, 900 per hour and 14,400 per day. (inference-docs.cerebras.ai) For GLM 4.7, the page lists the same 1 million token hourly and daily caps, but lower request caps of 10 per minute, 100 per hour and 100 per day. Those figures are more specific than the rough “about 10,000 daily requests” language circulating around the launch. On Cerebras’ own documentation, the free tier is model-dependent: three models carry a 14,400 daily request ceiling, while GLM 4.7 is capped at 100 daily requests. ### What does Cerebras say makes this API different? (inference-docs.cerebras.ai) Cerebras has centered its pitch on inference speed. Its inference page says the service processes responses at more than 3,000 tokens per second, and its pricing and model pages attach model-by-model speed claims to the public endpoints. Earlier company launch materials said the API used an OpenAI Chat Completions format so developers could switch providers with minimal code changes. (inference-docs.cerebras.ai) In separate product posts, Cerebras has tied that speed argument to coding and agentic workflows. A company post on Qwen3 said every developer gets 1 million tokens per day on the free tier, while another post on GPT OSS 120B said the model was available free in Cerebras Cloud. ### How does a developer actually start using it? (cerebras.ai) Cerebras’ quickstart guide says a user needs a Cerebras account, an API key and either Python or TypeScript tooling. The guide points developers to the company’s cloud console to create the key, then shows a first request using the `cerebras_cloud_sdk` package and the `client.chat.completions.create` method. The default API base URL in the docs is ` and the example request uses `gpt-oss-120b` as the model. (cerebras.ai) ### What changes next on the service? May 27, 2026 is the next dated change on Cerebras’ public docs. That is when the company says `llama3.1-8b` and `qwen-3-235b-a22b-instruct-2507` will be deprecated from the public endpoint lineup, leaving GPT OSS 120B and GLM 4.7 among the named models currently listed for shared access. (inference-docs.cerebras.ai 1) (inference-docs.cerebras.ai 2)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.