OpenAI releases three realtime voices
- OpenAI said on May 7 it added three live audio API models — GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper — for developers building voice apps. (openai.com) - The sharpest detail is the split: one model for reasoning, one for translation from 70+ input languages to 13 outputs, and one for low-latency transcription. (openai.com) - This matters because OpenAI is turning realtime voice from one general demo layer into a more production-shaped stack with clearer tradeoffs around latency, cost, and control. (developers.openai.com)
Voice AI is getting less like a flashy demo and more like infrastructure. That’s the real news here. OpenAI didn’t just ship a prettier talking model — it split live audio into three separate tools for three different jobs, all inside its API on May 7. The gap it’s trying to close is simple: voice assistants sound impressive until they have to reason through a messy request, translate fast enough to keep up with a person, or transcribe speech without lagging behind. (openai.com) ### Why three models instead of one? Because “voice” is actually three pretty different problems. One system has to hold a conversation and use tools, another has to act like a live interpreter, and a third just needs to turn speech into text as quickly and accurately as possible. (developers.openai.com) OpenAI now maps those jobs directly to GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper. ### What is GPT‑Realtime‑2 for? This is the main voice-agent model — the one meant to listen, reason, speak back, and call tools while the conversation keeps moving. OpenAI is pitching it as its first voice model with “GPT‑5‑class reasoning,” which is basically shorthand for harder multi-step requests, stronger instruction following, and better tool use than the earlier realtime models. (openai.com) It also supports configurable reasoning effort, which means developers can trade speed for more thinking when a task is harder. ### What changed from the older realtime model? The older general-availability model, gpt-realtime, was already built for live audio over WebRTC, WebSocket, or SIP. But GPT‑Realtime‑2 stretches the context window from 32,000 to 128,000 tokens and raises max output from 4,096 to 32,000. (openai.com) Output text pricing also moves from $16 to $24 per 1 million tokens, so the upgrade is not just “better” — it’s a more capable, more expensive reasoning layer. ### What does the translation model actually do? GPT‑Realtime‑Translate gets its own dedicated translation session and endpoint, which is a bigger deal than it sounds. Instead of pretending translation is just another chat turn, OpenAI treats it as continuous live audio in and translated audio plus transcript deltas out. (openai.com) The model supports speech from 70+ input languages into 13 output languages while trying to keep pace with the speaker. That makes it feel less like subtitle generation and more like an interpreter in the loop. ### And the Whisper model? GPT‑Realtime‑Whisper is the transcription specialist. It streams transcript deltas while someone is still talking, and OpenAI says developers can tune latency — lower delay for earlier partial text, higher delay for better quality. (developers.openai.com) That matters for call centers, note-taking, captions, and any workflow where the app needs text fast but doesn’t need the model to answer back. ### Why does this matter for builders? Because the architecture is getting clearer. OpenAI is no longer saying, “Use one voice model for everything.” It’s saying: pick the exact live-audio path you need — agent, translator, or transcription engine. That reduces orchestration glue, but it also pushes product teams to make sharper choices about latency, failure handling, and when a model should speak versus stay silent. (openai.com) ### What’s the catch? Realtime voice still inherits the old hard problems — bad audio, accents, interruptions, tool errors, and the weirdness of spoken conversations that change direction mid-sentence. OpenAI’s own docs basically nudge developers to test with real audio conditions, real languages, and real domain vocabulary before locking in defaults. (developers.openai.com) So the upgrade is real, but the responsibility still sits with the team shipping the experience. ### Bottom line? This launch is less about three new voices than about a cleaner map of live audio work. OpenAI is carving realtime speech into distinct production tools — one to think, one to translate, one to transcribe — and that makes voice apps easier to build, but also harder to hand-wave. (openai.com) (developers.openai.com)