OpenAI expands real-time voice models
- OpenAI launched three new API voice models on May 7: GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper for live reasoning, translation, and transcription. (openai.com) - The headline spec is GPT‑Realtime‑2’s 128,000-token context window, plus translation from 70+ input languages into 13 output languages in live sessions. (openai.com) - This pushes voice AI past simple turn-taking and toward agents that can remember, use tools, and act mid-conversation. (openai.com)
Voice AI is moving out of the demo phase. The hard part was never just making a bot sound natural — it was getting one to keep up, remember what was said, and do useful work while s(openai.com)audio models to its API, with one for voice reasoning, one for live translation, and one for streaming transcription. (openai.com)ntroduced GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper. They are split by job on purpose: one handles (openai.com)eech into text as it arrives. That sounds obvious, but it matters because realtime voice products break when one model is asked to do everything badly instead of one thing well. (openai.com) ### Why is GPT‑Realtime‑2 the main story? Because this is the model aimed at actual voice agents, not just dictation or canned rep(openai.com)instruction following, more reliable tool use, and configurable reasoning effort so developers can trade speed against depth. Basically, the model is supposed to think while talking without turning every conversation into a laggy pause-filled mess. (openai.com) ### Why does the 128k context window matter? Memory is the quiet bottleneck in voice sy(openai.com) which means a voice app can keep far more conversation history in play before it starts forgetting instructions, names, constraints, or earlier tool results. In practice, that makes longer customer-support calls, tutoring sessions, and multi-step assistants much more plausible. (developers.openai.com) ### What’s new on translation? OpenAI split translation into its own dedicated realtime session and model — G(openai.com)ges, and the system streams translated audio and transcript updates continuously instead of waiting for neat sentence boundaries. That is a big deal for meetings, events, travel, and support, where “close enough but late” is often worse than imperfect but immediate. (openai.com) ### And the transcription model? GPT‑Realtime‑Whisper is the streaming speech-to-text piece. It is b(developers.openai.com)n update while a person is still speaking. The key difference is that a transcription session does not try to answer back — it just emits text — which keeps the system simpler and better suited to accessibility and analytics use cases. (openai.com) ### Why separate endpoints? Because these are really three different products hiding under the label “voice.” OpenAI’s docs now distinguish (openai.com)transcription sessions on `/v1/realtime/transcription_sessions`. That separation should make deployment cleaner — and it signals that realtime voice is becoming infrastructure, not just a flashy model feature. (developers.openai.com) ### What changed from the earlier realtime push? The earlier Realtime API was about low-latency conversation. This update adds(openai.com)hifted from “talk to a model naturally” to “build a voice system that can listen, reason, translate, transcribe, and use tools while the conversation is unfolding.” That is a much more ambitious product claim. (openai.com) ### Bottom line The interesting part is not that OpenAI made voice sound better. It is that OpenAI is trying to make voice useful enough to become a real software interfac(developers.openai.com) best voice apps will feel less like talking to a chatbot and more like delegating work in real time. (openai.com)