OpenAI splits real-time voice models

- OpenAI launched three separate API voice models on May 7 — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — instead of one do-everything realtime stack. - The split maps cleanly to three jobs: speech agent, live translation, and streaming transcription, with new endpoints including `/v1/realtime/translations` and transcription sessions. - It matters because voice apps used to juggle multiple models and brittle glue code; OpenAI is now productizing that orchestration layer.

Voice AI is getting less monolithic. That’s the real news here. On May 7, OpenAI split its realtime audio stack into three distinct API models — one for live speech agents, one for live translation, and one for streaming transcription. The point is simple: stop making developers stitch together a half-dozen moving parts just to build something that can listen and talk back. ### What actually changed? OpenAI introduced GPT-Realtime-2 for speech-to-speech agents, GPT-Realtime-Translate for live spoken translation, and GPT-Realtime-Whisper for realtime speech-to-text. It also added purpose-built API surfaces for those jobs, including `/v1/realtime`, `/v1/realtime/translations`, and transcription sessions. That’s a cleaner product map than the older “realtime plus extra components” approach. (openai.com) ### Why split them up? Because “voice app” sounds like one thing, but it usually isn’t. A customer-support bot needs turn-taking, tool use, and reasoning. A live interpreter needs low-latency translation while someone is still speaking. A captioning app just needs fast, steady text. Cramming all three into one general realtime model makes orchestration harder and usually pushes developers to bolt on extra speech recognition or translation layers anyway. OpenAI is basically admitting those are separate workloads. (openai.com) ### What is Realtime-2 for? This is the agent model. OpenAI says GPT-Realtime-2 is built for speech-to-speech interactions and adds configurable reasoning, better context handling, and more natural conversation flow. The important bit is “configurable reasoning” — developers can tune how much thinking the model does instead of treating voice as a thin wrapper around a text model. That matters for harder calls, support flows, and assistant-style tasks where the model has to decide, not just transcribe. (openai.com) ### What is Realtime-Translate for? It’s a dedicated live interpreter. You stream source audio into a translation session and get translated audio plus transcript deltas while the speaker is still talking. OpenAI positions it for multilingual calls, meetings, broadcasts, lessons, and video rooms. The key distinction is that this is not an assistant that happens to know another language — it’s a model path optimized for “human says X, system says X in another language right now.” (openai.com) ### And Realtime-Whisper? That one is for streaming transcription. OpenAI says GPT-Realtime-Whisper lets developers trade latency against transcript quality, which is exactly the knob captioning and call-analysis systems care about. If you want partial text sooner, you can bias for lower delay. If you want cleaner output, you can wait a bit longer. That sounds small, but it’s the kind of control production teams actually need. ### Why does this matter for developers? (developers.openai.com) Because the old pain was never just model quality. It was architecture. A realtime voice product often meant one system for speech recognition, another for reasoning, another for translation, plus custom logic to keep them in sync. That stack works, but it’s expensive and brittle — like building a live radio booth out of separate mixers, delays, and adapters. OpenAI is trying to collapse more of that into first-party primitives. (developers.openai.com) ### Is this replacing the older realtime model? Not exactly. OpenAI’s docs still describe `gpt-realtime` as its advanced speech-to-speech model in prior guidance, but the May 7 changelog adds these newer specialized models on top. So this looks less like a hard replacement and more like a product-line split: one general realtime path, plus more targeted voice models for translation and transcription, and a newer agent-focused tier in Realtime-2. (openai.com) ### Bottom line? OpenAI didn’t just ship better voice models. It carved voice into three concrete jobs and gave each one its own lane. That should make voice agents easier to build, easier to reason about, and a lot less dependent on custom orchestration glue. (openai.com 1) (openai.com 2)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.