OpenAI ships 3 realtime voice models
- OpenAI released three realtime voice models for developers to power live speech apps with low‑latency interaction, translation and streaming transcription. - The models are named GPT‑Realtime‑2, GPT‑Realtime‑Translate and GPT‑Realtime‑Whisper and are being pitched as infrastructure for voice apps rather than demos. - Realtime voice forces engineers to solve streaming protocols, interruption handling, latency budgets and local vs remote inference for practical voice apps. (testingcatalog.com) (techradar.com)
Voice AI is shifting from “talk to a demo” to “run a real product.” That’s the point of OpenAI’s new realtime audio release. Instead of one general voice model trying to do everything, OpenAI split the stack into three jobs — reasoning in conversation, live translation, and streaming transcription — and put them into the API for developers building apps that have to keep up with actual human speech. ### What actually shipped? OpenAI introduced GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper in its API on May 7, 2026. GPT‑Realtime‑2 is the conversational one — the model meant to listen, reason, speak back, and use tools while the conversation is still happening. GPT‑Realtime‑Translate handles live speech translation. GPT‑Realtime‑Whisper does low-latency speech-to-text for captions, notes, and other streaming transcript use cases. ### Why split this into three models? Because “voice” is really three different engineering problems wearing one label. A voice agent has to understand interruptions, decide what to do next, maybe call a tool, and answer naturally. Translation has a different job — keep pace with a speaker while preserving meaning across languages. Transcription is different again — get text on screen fast, with controllable delay and decent accuracy. OpenAI’s docs now point developers to a different model depending on which of those jobs they actually need. ### What’s special about GPT‑Realtime‑2? OpenAI is pitching GPT‑Realtime‑2 as its first voice model with “GPT‑5‑class reasoning.” The important part is not the branding — it’s that the model can spend more or less effort thinking, with a tradeoff between intelligence and latency. That matters for voice apps because every extra beat feels awkward. A support bot helping reset a password can’t pause like it’s writing an essay. But a medical intake assistant or travel planner might benefit from a little more reasoning if the answer is better. OpenAI also kept pricing for text tokens in line with the earlier 1.5 model, while exposing that reasoning dial to developers. ### Why does translation need its own model? Live translation is the hardest version of the trick. The system has to listen to speech in one language, decide what the speaker means before the sentence is fully over, and start speaking in another language without falling badly behind. OpenAI says GPT‑Realtime‑Translate takes speech from 70+ input languages and can output 13 languages in realtime. That makes it less like subtitle software and more like infrastructure for bilingual meetings, travel assistants, and customer support handoffs. ### What does “streaming Whisper” change? The old mental model for transcription was upload audio, wait, get text. GPT‑Realtime‑Whisper is for the opposite case — text arrives while the person is still talking. OpenAI says developers can tune latency, which is basically choosing where to sit on the speed-versus-accuracy slider. Lower delay gives faster partial text. Higher delay can improve quality. That sounds small, but it’s the difference between captions that feel instant and captions that feel like they’re chasing the speaker. ### Why is realtime voice still hard? Because the model is only one piece. The app still has to keep a session open over WebRTC, WebSocket, or SIP, stream audio both ways, handle barge-in when the user interrupts, decide when the model should speak, and keep latency low enough that the exchange feels human. OpenAI’s own docs now frame these models as pieces of a production system, not just a flashy interface. ### Who is this for? Developers first. OpenAI is clearly aiming at companies building customer support agents, interpreters, meeting tools, assistants, and other voice products that need to act in real time. Reports around the launch also point to companies like Zillow and Priceline testing these tools, which fits the pattern — voice is becoming application infrastructure, not just a chatbot party trick. ### Bottom line? The real news is not just that OpenAI shipped three new voice models. It’s that realtime voice is getting unbundled into specialized components. Basically, the industry is moving from “can AI talk?” to “can AI listen, think, translate, transcribe, and respond fast enough to be useful?” OpenAI wants to be the default plumbing for that shift.