Grok adds Speech API

- xAI launched Grok’s Speech-to-Text API offering instant multi-speaker transcription support. - The API covers 25 languages and is positioned at competitive pricing for developers. - The service targets ML apps and voice tools that need reliable, multi-speaker transcription capabilities (x.com).

xAI has added a standalone Speech to Text application programming interface, giving developers a new way to turn live or recorded audio into text through Grok. (x.ai) xAI announced the release on April 17, 2026, alongside a separate Text to Speech product. The new Speech to Text endpoint is available at `/v1/stt` and supports both batch uploads over REST and real-time streaming over WebSocket. (x.ai, docs.x.ai) The company says the transcription service works in 25 languages and includes word-level timestamps, speaker diarization, and multichannel transcription. Speaker diarization is the feature that labels who spoke when in a recording with more than one voice. (docs.x.ai, docs.x.ai) xAI priced the service at $0.10 per hour for batch transcription and $0.20 per hour for streaming. Its documentation lists 600 requests per minute for both modes, 10 requests per second, and 100 concurrent streaming sessions per team in the us-east-1 region. (x.ai, docs.x.ai) Speech-to-text systems sit underneath call-center software, meeting note tools, captioning products, and voice agents. xAI is now selling that layer directly instead of limiting it to Grok’s own consumer voice features. (x.ai, docs.x.ai) The launch also extends xAI’s push into developer infrastructure. Its docs now group Speech to Text with the company’s Voice Agent application programming interface and Text to Speech service, which together cover listening, speaking, and live voice interaction. (docs.x.ai, x.ai) xAI says the transcription model is built on the same voice stack used in Grok Voice, Tesla vehicles, and Starlink customer support. In its launch post, the company compared its hourly rates with AssemblyAI, ElevenLabs, and Deepgram and presented internal word error rate benchmarks across phone calls, meetings, podcasts, and telephony audio. (x.ai) The product page also highlights formatting that rewrites spoken numbers, dates, and currencies into standard written form. In the streaming documentation, xAI says developers can force a language code such as `en`, `fr`, `de`, or `ja`, and that setting enables this normalization behavior. (x.ai, docs.x.ai) For developers building note takers, call logs, or voice assistants, the update means Grok now has a dedicated transcription endpoint to go with its existing voice and model APIs. (docs.x.ai, x.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.