OpenAI launches realtime voice trio
- OpenAI said on May 7 it put three new realtime voice models into its API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. - The sharpest detail is Zillow’s early test — its hardest call benchmark rose to 95% success from 69% after prompt tuning. - This pushes voice AI toward full production systems, not demos — and raises the bar for rivals on latency and integration.
Voice AI has had a demo problem for years. It could sound fluid for 20 seconds, then fall apart when a caller interrupted, switched languages, or asked for something that needed memory and tool use. OpenAI is trying to close that gap. On May 7, it released three new realtime audio models in its API — one for conversational voice agents, one for live translation, and one for streaming transcription. ### What actually launched? The lineup is pretty clean. GPT-Realtime-2 is the voice agent model — the one meant to carry a live conversation, reason through harder requests, and call tools while speaking. GPT-Realtime-Translate is a separate model for interpreter-style sessions. GPT-Realtime-Whisper handles low-latency speech-to-text. OpenAI split these jobs instead of forcing one model to do everything, which matters because translation, transcription, and conversation have different latency and control needs. (openai.com) ### Why split them into three? Because “talking” is really three different products hiding under one interface. A voice agent has to remember context and decide what to do next. A translator should stay out of the way and just convert speech between languages. A transcription system should optimize for speed, partial text, and stable output. OpenAI’s docs even put them on different paths — standard realtime sessions for GPT-Realtime-2, a dedicated translations endpoint for GPT-Realtime-Translate, and separate latency controls for GPT-Realtime-Whisper. (openai.com) ### What is the flagship model good at? GPT-Realtime-2 is the main story. OpenAI describes it as bringing “GPT-5-class reasoning” into live voice, with configurable reasoning effort so developers can trade speed for better handling of complex requests. That sounds abstract, but the practical point is simple — the model is supposed to stay coherent while a user interrupts, changes direction, or asks for something that requires tools and memory instead of canned responses. (developers.openai.com) ### How broad is the translation push? Pretty broad on input, narrower on output. GPT-Realtime-Translate supports speech from more than 70 input languages into 13 output languages, and OpenAI prices it by audio minute instead of text tokens. That tells you how the company wants this used — call centers, travel assistants, meetings, and any app where people are just talking continuously instead of sending neat text prompts. (openai.com) ### Is there any real-world proof yet? A little — but it’s early, and the headline example comes from an interested party. Zillow, one of the early testers, said its hardest adversarial benchmark improved from 69% to 95% call success after prompt tuning with GPT-Realtime-2. That is a huge jump if it holds up in production. But it is still a vendor-plus-customer success story, not an independent bake-off. (developers.openai.com) ### Why does this matter beyond one launch? Because the competition in voice AI is no longer just about model quality. It is about the whole loop — audio in, reasoning, tool use, translation or transcription, then audio back out with low enough latency that a human doesn’t get annoyed. OpenAI already had realtime infrastructure, but this launch makes the product map much more explicit. Developers can now choose a model built for the exact voice job they want instead of stitching together a stack themselves. (aihola.com) ### What’s the catch? The catch is that better reasoning usually costs either time or money. OpenAI’s own model page says higher reasoning effort can increase latency and output token usage for GPT-Realtime-2. So the hard product question does not disappear — how smart can a voice agent get before the pause becomes awkward or the bill becomes painful? (platform.openai.com) ### Bottom line This launch is really about turning voice from a flashy interface into a dependable application layer. If GPT-Realtime-2 can keep conversations on track, while the translation and transcription models handle the simpler jobs cleanly, OpenAI gets something more valuable than a good demo — it gets a default stack for building voice software. (openai.com) (developers.openai.com)