OpenAI launches GPT‑Realtime‑2 voice stack
- OpenAI rolled out three new Realtime API voice models on May 7: GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper for live agents. - The standout detail is translation scale — 70+ spoken input languages into 13 output languages — plus configurable reasoning that trades speed for depth. - This shifts voice from chatbot demo territory toward production agents that can listen, think, transcribe, translate, and use tools mid-conversation.
Voice AI has had a pretty obvious problem. It could sound fast, or it could sound smart, but usually not both. The awkward pause before an answer was the giveaway. So this launch matters because OpenAI is trying to collapse that tradeoff inside one live audio stack — not just a prettier voice, but models built to reason, translate, and transcribe while someone is still talking. OpenAI said on May 7 it is adding GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper to its Realtime API. ### What actually shipped? Three separate models. GPT‑Realtime‑2 is the main voice agent model. GPT‑Realtime‑Translate handles live speech-to-speech translation. GPT‑Realtime‑Whisper handles streaming speech-to-text. OpenAI is basically splitting one messy “do everything” voice pipeline into cleaner parts that developers can mix depending on the job. ### Why split the stack up? (openai.com) Because live voice apps have conflicting needs. A customer-support bot may need reasoning and tool use. A meeting app may mostly need fast transcription. A translation app cares about keeping pace with the speaker. When one model tries to do all of that, latency and complexity pile up. OpenAI’s pitch is that separate realtime models cut orchestration overhead and let builders choose the right tradeoff. ### What is new about GPT‑Realtime‑2? This is the first OpenAI voice model the company describes as having GPT‑5‑class reasoning. More importantly, developers can tune “reasoning effort,” which means the model can spend more compute on harder requests at the cost of more latency and output tokens. That is the core bet here — some voice tasks need instant replies, but others need a model that can actually think for a beat before acting. (venturebeat.com) ### Why does that matter for voice agents? Because the hard part of voice is not speech anymore. It is decision-making under time pressure. A live agent has to keep track of context, decide whether to call a tool, maybe ask a follow-up, and not lose the thread while audio keeps arriving. OpenAI is framing GPT‑Realtime‑2 as a model that can carry the conversation forward naturally while still doing those backend tasks in real time. (openai.com) ### What about translation? GPT‑Realtime‑Translate is the most concrete part of the launch. It supports speech from 70+ input languages into 13 output languages, and OpenAI says it is designed to keep pace with the speaker rather than waiting for full turns. That makes it closer to an interpreter than a batch translation tool — less polished maybe, but much more usable in an actual live conversation. (openai.com) ### And Whisper? Realtime‑Whisper is for streaming transcription, not voice replies. The useful detail is controllable latency — lower delay for earlier partial text, higher delay for better transcript quality. That sounds small, but it matters for captions, call monitoring, and note-taking apps where the timing of text is part of the product. ### How is it priced? The pricing model tells you how OpenAI expects these tools to be used. (openai.com) GPT‑Realtime‑2 is billed by token consumption. Translate and Whisper are billed by audio duration, with GPT‑Realtime‑Translate listed at $0.034 per minute. In other words, reasoning is treated like model work; translation and transcription are treated more like streaming utilities. ### So what changed, really? (developers.openai.com) The big change is architectural. OpenAI is no longer treating realtime voice as one flashy demo model. It is turning it into a developer stack — one model for smart conversation, one for live translation, one for live transcription. If that works in production, voice agents stop being “talk to a bot” novelties and start looking more like software that happens to speak. (openai.com) (developers.openai.com)