OpenAI ships GPT‑Realtime‑2 voice

- OpenAI launched three realtime voice API models on May 7: GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper, aimed at live speech apps. - The standout detail is scope: translation now takes 70+ spoken input languages into 13 output languages, while Realtime‑2 adds configurable reasoning. - This moves OpenAI’s voice stack from basic speech chat toward full live agents, interpreters, and transcription systems developers can ship now.

Voice AI has been good at one thing at a time. You could get a chatbot that talks, or a transcription system, or a translation layer. But stitching those together into something that feels like a live human conversation has been the hard part. OpenAI’s May 7 release is basically an attempt to collapse those pieces into one realtime stack: a new reasoning-heavy voice model, a dedicated live translation model, and a streaming transcription model. (openai.com) ### What actually shipped? OpenAI added three models to its API: `gpt-realtime-2` for speech-to-speech agents, `gpt-realtime-translate` for live interpretation, and `gpt-realtime-whisper` for streaming speech-to-text. They sit on top of the company’s Realtime API, which already handled low-latency audio sessions over WebRTC, WebSocket, and SIP. The change is not just “be(openai.com)criber. (openai.com) ### Why is GPT‑Realtime‑2 the big one? Because this is the model that tries to think while talking. OpenAI describes it as a new realtime voice model with configurable reasoning effort, which means developers can trade some speed for better handling of harder requests and more context-heavy conversations. That is the real shift here — voice systems have usually been fast(openai.com) higher-end reasoning. (developers.openai.com) ### What does “configurable reasoning” mean in practice? It means developers can tune how much thinking the model does before answering. OpenAI’s docs explicitly warn that higher reasoning effort can raise latency and output token usage, and they recommend starting low for most production voice agents. So this is not magic. It is a knob. If you are building a customer-support bot, (developers.openai.com)ing agent, you may accept a little delay for better answers. (developers.openai.com) ### Why split out translation? Because translation has different rules from conversation. A voice assistant is supposed to respond. An interpreter is supposed to preserve what the speaker said, in another language, while the speaker is still talking. OpenAI’s translation model is built for that narrower job — streaming source audio in, then returning translated audio plus transcri(developers.openai.com)ts, lessons, and video rooms. (developers.openai.com) ### How broad is the language support? The headline number is 70+ input languages and 13 output languages for realtime translation. That matters because live translation systems usually break not on the demo language pair, but on the long tail — accents, code-switching, and less common languages. OpenAI is clearly trying to make the product feel less like a toy travel demo and more like infrastructure for actual cross-language communication. (community.openai.com) ### And what about transcription? OpenAI also shipped `gpt-realtime-whisper`, which is meant for live speech-to-text sessions rather than assistant replies. The company’s docs make an important distinction here: transcription sessions are separate from voice-agent sessions, and the model is positioned as an option for live transcription rather than a blanket replacem(community.openai.com)developers.openai.com) ### So what changed versus before? Before this, OpenAI already had realtime voice and a general-availability realtime model. The new release adds specialization and a more explicit product map. Instead of one generic realtime model doing everything, developers now get a reasoning-first voice agent, a dedicated interpreter, and a dedicated transcriber. That usually means th(developers.openai.com)n jobs.” (openai.com) ### Bottom line The important part is not that OpenAI made voice sound nicer. It is that the company is treating live speech as its own application layer now — with separate models for thinking, translating, and transcribing. If that works in production, voice apps stop being chatbots with a microphone and start feeling more like real-time software for calls, meetings, support, and language access. (([openai.com)with-new-models-in-the-api/))

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.