Grok Speech‑to‑Text API

- xAI launched Grok's Speech‑to‑Text API offering instant multi‑speaker transcription across 25 languages. - The service targets embedding low‑latency transcription into apps with multi‑speaker diarization and competitive pricing. - This provides platforms a ready option to add real‑time speech features without building and maintaining speech models. (x.com)

Speech-to-text is software that turns spoken audio into written text, like live captions for a call or a podcast. xAI said on April 17 it is now selling that capability as a standalone Grok application programming interface, or API. (x.ai) The new Grok Speech to Text API supports batch uploads over a REST interface and real-time transcription over WebSocket streaming. xAI said the service can identify multiple speakers, add word-level timestamps, and handle multichannel audio in the same API. (x.ai) xAI’s developer docs list support for 25 languages, including streaming interim results, and price the service at $0.10 per hour for batch transcription and $0.20 per hour for streaming. The docs also list rate limits of 600 requests per minute for both batch and streaming, with up to 100 concurrent streaming sessions per team. (docs.x.ai) The company is pitching the product to developers that want voice features inside existing software rather than a full voice bot. xAI’s launch post names transcription tools, accessibility products, podcasts, and interactive audio apps as target uses. (x.ai) That matters because speech products usually need two layers at once: recognition that hears words and formatting that turns spoken phrases into usable text. xAI said its system includes inverse text normalization, which converts spoken numbers, dates, currencies, and similar phrases into standard written forms. (x.ai) The launch also extends xAI’s push beyond chatbots into developer infrastructure. Its docs now list three separate voice products: a speech-to-text endpoint at `/v1/stt`, a text-to-speech endpoint at `/v1/tts`, and a real-time voice agent endpoint at `/v1/realtime`. (docs.x.ai) xAI is tying the new audio tools to systems it already runs at consumer scale. The company said the same stack powers Grok Voice, Tesla vehicles, and Starlink customer support, echoing a similar claim it made when it opened the Grok Voice Agent API in December 2025. (x.ai 1) (x.ai 2) In its launch materials, xAI also published benchmark tables comparing Grok Speech to Text with ElevenLabs, Deepgram, and AssemblyAI across phone calls, meetings, telephone audio, and video or podcast clips. Those figures come from xAI, not an independent lab, and the company said Grok posted the lowest overall word error rate in its tests. (x.ai) For developers, the immediate change is practical: xAI now offers a ready-made way to plug live transcription into apps without training or hosting a speech model themselves. The company’s API site says the broader Grok platform is aimed at text, voice, image generation, and real-time search from the same console. (x.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.