xAI launches Grok Voice API
- xAI rolled out Custom Voices on April 30, letting developers clone a voice from a short recording and use it in Grok TTS and realtime agents. - The key detail is speed and scope — xAI says a voice can be created in under two minutes, with over 80 built-in voices across 28 languages. - It matters because xAI just turned Grok’s voice stack into a full platform — but with U.S.-only limits and anti-impersonation gates.
Voice cloning is the new part here — not just “Grok can talk,” but “Grok can sound like you.” That matters because voice AI has been moving from novelty demos into customer support, media, accessibility, and phone agents fast. The gap was that xAI already had realtime voice, text-to-speech, and speech-to-text, but not the obvious next layer — custom identity. On April 30, xAI added that layer with Custom Voices and a Voice Library inside its API console. ### What actually launched? xAI launched Custom Voices, which lets a user clone a voice from a short recording and then use that cloned voice across the company’s Text-to-Speech API and Voice Agent API. It also launched Voice Library, a console page for managing built-in and custom voices in one place. This is an extension of the voice stack xAI had already been building through late 2025 and April 2026. ### How much audio do you need? Not much, at least in theory. xAI’s product post says you can clone a voice from a few seconds of audio and get a production-ready model in under two minutes. But the docs are more practical — they recommend a reference clip up to 120 seconds long, say clips under 30 seconds may lose detail, and suggest 90 to 120 seconds for the best result. Basically, the math gives us about a minute or two if you want quality.” ### What makes the voices feel more real? The big thing is expressiveness. xAI’s TTS stack already supports inline speech tags for pauses, laughter, whispers, pitch shifts, speed changes, and other delivery controls. So Custom Voices is not just about matching timbre — the sound of a person’s voice — but also about carrying expressive style into generated speech. That is why the demos sound more like performance. ### Where can developers use it? Anywhere xAI’s voice endpoints already work. A custom `voice_id` can be passed into the standard TTS endpoint, the TTS WebSocket endpoint, or the realtime Voice Agent API. That means one cloned voice can narrate audio, speak in live assistants, or front a phone-style agent with tool use and web search. xAI is clearly trying to make voice identity a reusable platform primitive, not a one-off feature. ### What guardrails did xAI add? This is the most important part. xAI says every custom voice goes through a two-stage verification flow. First, the speaker reads a passphrase aloud so the system can verify intent and live presence through transcription. Then xAI compares speaker embeddings from the passphrase clip and the longer recording to confirm they won't be able to clone a voice from a pre-existing recording or clone someone else’s voice. ### Is it open to everyone? Not fully. The docs say Custom Voices is currently available only in the United States, except Illinois. They also say users can create up to 30 custom voices for free in the console, while API creation is Enterprise-only. So this is broad enough to matter, but still controlled enough that xAI can meter rollout and risk. ### Why launch this now? Because xAI has spent the last few months filling in the rest of the stack. It opened the Voice Agent API in December 2025, added standalone STT and TTS on April 17, and released the flagship `grok-voice-think-fast-1.0` model on April 23. Custom Voices is the missing piece that turns those parts into a more complete developer platform. ### Bottom line? This is less a flashy demo than a platform move. xAI now has realtime voice agents, standalone speech APIs, and voice cloning tied together under one roof. The upside is obvious — more natural assistants, branded voices, and accessibility use cases. The catch is just as obvious — once synthetic voices become cheap, fast, and emotionally convincing, trust becomes the real product.