Cerebras offers 1M‑token free LLM API

- Cerebras Systems is offering free API access to its inference service on May 14, 2026, letting developers test open-weight large language models without upfront payment. - Cerebras says the free plan includes up to 1 million tokens per day, while its pricing page lists GPT-OSS 120B, Qwen 3 and GLM 4.7. - Cerebras documentation says developers can create an API key through its cloud console, and some listed models are scheduled for deprecation on May 27.

Cerebras Systems is offering free API access to its inference platform, giving developers a way to test several open-weight large language models without paying upfront, according to the company’s pricing page and developer documentation. The company’s site says the free plan includes access to Cerebras-powered models and positions the service as a starting point for prototyping prompts, agents and real-time applications. Cerebras’ inference page says the platform can process more than 3,000 tokens per second on some models. Its quickstart guide says developers can get a free API key and make chat-completion calls through Python, Node.js or direct API requests. ### Which models are included in the free access? Cerebras’ pricing page lists GPT-OSS 120B, Z.ai’s GLM 4.7, Meta’s Llama 3.1 8B and Alibaba’s Qwen 3 235B Instruct among the models available through its inference service. The same page shows posted pay-as-you-go prices for those models, including $0.35 per million input tokens and $0.75 per million output tokens for GPT-OSS 120B, and $2.25 per million input tokens and $2.75 per million output tokens for GLM 4.7. (cerebras.ai) Cerebras separately announced support for Qwen 3 235B Instruct in July 2025 and said at the time that the model was available through its inference cloud with a free tier of 1 million tokens per day. Cerebras also said in January 2026 that GLM-4.7 was available on its platform, and in an earlier post it said GPT-OSS 120B was available on Cerebras Cloud as a launch-partner deployment. (cerebras.ai) ### How much can a developer use before paying? Cerebras has publicly described its free access as a 1 million-token-per-day tier. The company repeated that figure in its Qwen 3 launch post and in other product materials tied to its inference service. Cerebras’ current public pricing page does not spell out a daily request cap in the text surfaced by the site, and Reuters could not independently verify the “10,000 requests per day” figure from Cerebras’ official pages reviewed for this article. (cerebras.ai) Cerebras’ documentation does say rate limits are measured across requests and tokens in minute, hour and day windows, and that any one of those thresholds can trigger limiting. ### How does a developer get started? Cerebras’ quickstart documentation says developers need a Cerebras account and an inference API key, which can be created through the company’s cloud console. The guide says the API can be called through the company’s SDKs or directly through its endpoint, and it provides sample code using the model name “gpt-oss-120b.” The company’s inference materials say the service is OpenAI-compatible, a detail that can reduce migration work for teams already using OpenAI-style chat-completions patterns. (inference-docs.cerebras.ai) Cerebras’ site also says developers can test the models in a playground before making API calls. ### What does Cerebras say about speed and paid upgrades? Cerebras says its inference platform exceeds 3,000 tokens per second on some workloads and describes the service as built for real-time applications such as coding, summarization and autonomous tasks. (inference-docs.cerebras.ai) The company’s pricing page says the next paid developer tier starts at $10 and offers 10 times higher rate limits than the free tier, while enterprise plans add dedicated support and higher throughput. (inference-docs.cerebras.ai) A January 2026 company post on GLM-4.7 said the model ran at about 1,000 tokens per second on Cerebras hardware, while a separate post on GPT-OSS 120B said the model ran at 3,000 tokens per second on Cerebras Cloud. Those performance claims came from Cerebras. ### Are there any limits or changes developers should watch? Cerebras’ supported-models documentation says free-tier rate limits for GLM 4.7 and GPT-OSS 120B have been temporarily reduced because of demand. (cerebras.ai) The same page says Llama 3.1 8B and Qwen 3 235B Instruct will be deprecated on May 27, 2026. May 27, 2026 is the next concrete date on Cerebras’ public model list. Developers using Llama 3.1 8B or Qwen 3 235B Instruct would need to check Cerebras’ supported-models page and migration documentation before that deadline. (cerebras.ai) (inference-docs.cerebras.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.