xAI launches Grok voice API
- xAI rolled out Custom Voices on April 30, letting developers clone a voice from about a minute of speech for Grok’s TTS and realtime voice agents. - The telling detail is speed and scope — xAI says cloning takes under two minutes, supports 80-plus built-in voices, and works across 28 languages. - It matters because voice is shifting from demo feature to deployable stack — with MCP tools, Grok can now talk and act.
Voice APIs are turning into full application stacks. Not just “make this text sound nice,” but listen, speak, call tools, and pull data from other systems while a conversation is still happening. That has been the missing piece for a lot of companies trying to build phone agents or voice interfaces that do real work. xAI’s latest move is to close that gap from both sides at once — custom voice cloning on one side, and live tool access on the other. (x.ai) ### What actually launched? The new piece is called Custom Voices. xAI announced it on April 30, 2026, and the pitch is simple: record about a minute of speech, wait under two minutes, and you get a production-ready cloned voice you can use across Grok Text to Speech and the Voice Agent API. Alongside that, xAI added a Voice Library inside its console so teams can manage cloned and built-in voices in one place. (x.ai) ### Is this a separate product or part of Grok’s voice stack? It sits inside a broader voice platform xAI has been building in public over the last few weeks. On April 17, xAI launched standalone speech-to-text and text-to-speech endpoints. On April 23, it launched grok-voice-think-fast-1.0, a realtime voice model aimed at support, sales, and other multi-step workflows. Custom Voices plug(x.ai)n standing alone. (x.ai) ### How does the cloning part work? The basic flow is pretty lightweight. You upload a reference clip up to 120 seconds long, and xAI says 90 to 120 seconds is best. The resulting `voice_id` works like any built-in voice ID in the TTS and realtime voice APIs. In practice, that means a developer can swap from a stock voice to a branded or personal one without changing the rest of the app architecture. (docs.x.ai) ### What’s the safety catch? xAI is trying to make “clone any voice” less reckless than it sounds. The system uses a two-step verification flow: first a spoken passphrase that gets transcribed in real time, then a speaker-embedding check that compares the passphrase clip with the longer recording. xAI’s claim is that this blocks cloning from a pre-exist(docs.x.ai)t — Custom Voices is currently available only in the United States, except Illinois. (x.ai) ### Why do the connectors matter here? Because a voice agent without tools is still mostly a demo. xAI’s Remote MCP tools let Grok connect to outside systems through Model Context Protocol servers, and that support extends to the Voice Agent API too. So the same agent that speaks in a custom voice can also look things up, trigger actions, or query company systems during a call. Basically, (x.ai)c workflow. (docs.x.ai) ### What can developers do with it now? xAI’s own examples point at customer support, appointment booking, sales, narration, audiobooks, and accessibility use cases. The built-in voice catalog has expanded to more than 80 voices across 28 languages, while the realtime stack supports low-latency turn-taking over WebSockets and browser-friendly ephemeral client secrets. So thi(docs.x.ai)onversations in web apps, phone flows, and embedded products. (x.ai) ### Is the real story voice quality or workflow integration? Both, but the workflow side is the bigger shift. Plenty of companies can now synthesize a convincing voice. The harder part is making that voice useful in production — secure auth, low latency, tool use, and access to business data. xAI is clearly trying to bundle all of that into one developer path, with cloning as the attentio(x.ai) part. That last point is an inference, but it fits the way these launches were staged across April. (x.ai) ### Bottom line? The news is not just that xAI can clone a voice. It’s that Grok’s voice stack is turning into a deployable agent platform — one that can sound like a person, respond in real time, and reach into outside tools while it talks. That is a much bigger product than a voice demo. (x.ai)