OpenAI launches three realtime voice models
- OpenAI added three API voice models on May 7 — gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper — aimed at live speech agents, translation, and transcription. - The sharpest detail is scope: translation takes 70+ input languages into 13 outputs, while gpt-realtime-2 adds configurable reasoning that trades latency for smarter replies. - This pushes voice from demo to product surface, but builders still have to manage interruptions, tool use, latency, and token cost.
Voice AI is shifting from “talk to a bot” demos to actual product plumbing. That’s the real story here. OpenAI used the API update it posted on May 7 to add three new realtime audio models — one for voice agents, one for live translation, and one for streaming transcription. The pitch is simple: make speech something apps can actually build around, not just bolt on at the end. ### What actually launched? OpenAI split the job into three models. gpt-realtime-2 is the main voice-agent model — speech in, speech out, with stronger reasoning and tool use. gpt-realtime-translate is a dedicated interpreter model for live speech translation. gpt-realtime-whisper is the low-latency transcription model for turning live audio into streaming text. OpenAI is positioning them as separate primitives rather than one model that does everything badly. (openai.com) ### Why split them up? Because realtime voice has conflicting goals. A voice agent needs to think, decide, maybe call a tool, and then answer naturally. A translator needs to keep pace with a speaker and stay faithful to meaning. A transcription model just needs fast, stable text. OpenAI’s docs make that separation explicit — if you want an assistant, use gpt-realtime-2; if you want an interpreter, use the translation stack instead. (openai.com) ### What’s new about gpt-realtime-2? The big change is reasoning. OpenAI describes gpt-realtime-2 as its first voice model with GPT-5-class reasoning, and the docs say developers can tune “reasoning effort.” That matters because voice systems usually cheat toward speed — they blurt something out fast, but fall apart on multi-step requests or tool calls. Here, developers can push the model to think harder, but the catch is higher latency and more output-token use. (developers.openai.com) ### How broad is the translation model? Broader than the name suggests. OpenAI says gpt-realtime-translate can take speech from 70+ input languages and render it into 13 output languages while keeping pace with the speaker. The cookbook examples point at two obvious use cases: one-to-many translation for livestreams and keynotes, and conversational translation for things like call centers or multilingual video chat. That makes it less like subtitles and more like a live interpreter service you can wire into an app. (openai.com) ### Where does Whisper fit now? It becomes the specialist. OpenAI’s realtime guide says gpt-realtime-whisper is for streaming transcription with controllable latency — lower delay for earlier partial text, higher delay for better transcript quality. That sounds small, but it is the kind of tradeoff product teams actually need when they are building captions, meeting notes, moderation pipelines, or creator tools. ### Why does this matter for builders? (openai.com) Because the hard part of voice apps is no longer just speech recognition. It’s orchestration. Realtime sessions stay open, keep state, stream events, and can update behavior mid-conversation. The model can also use tools more precisely than earlier realtime systems. That gets you closer to customer support agents, translators, tutors, and phone workflows that feel coherent instead of brittle. (developers.openai.com) ### So what’s still hard? All the ugly product decisions. Interruptions. Turn-taking. Permissions. Whether the assistant should act or just suggest. Whether a translation should preserve tone or prioritize speed. And cost — gpt-realtime-2 pricing still rises with token usage, while more reasoning can add both latency and spend. Voice is getting better as a model capability, but shipping it still means designing the control layer around the model. (developers.openai.com) ### Bottom line? OpenAI didn’t just ship “better voice.” It carved realtime audio into three clearer jobs and made each one more usable. That is what turns voice from a flashy interface into something developers can treat like infrastructure. (openai.com) (developers.openai.com)