OpenAI launches GPT‑Realtime‑2 API to power GPT‑5‑class real‑time voice agents
- OpenAI launched GPT‑Realtime‑2 on May 7, alongside Realtime Translate and Realtime Whisper, expanding its API for low-latency speech agents that reason while listening. - The new model keeps GPT‑Realtime pricing at $4 per 1M text input tokens, but adds configurable reasoning effort that can trade latency for smarter replies. - This pushes voice apps beyond chatbots into meetings, support, and translation — where interruption handling and transcript drift become product-defining problems.
Voice AI is moving from “talks back” to “actually thinks while you talk.” That’s the real news here. On May 7, OpenAI rolled out GPT‑Realtime‑2, plus two sibling models for live translation and streaming transcription, inside its API stack. The point is not just nicer synthetic speech. It’s to let developers build voice agents that can reason, translate, and respond fast enough to feel conversational rather than queued. (openai.com) ### What actually launched? OpenAI shipped three audio models: GPT‑Realtime‑2 for speech-to-speech agents, GPT‑Realtime‑Translate for live speech translation, and GPT‑Realtime‑Whisper for streaming speech-to-text. They sit on top of the existing Realtime API, which already supports low-latency connections over WebRTC, WebSocket, and SIP. So this is less a brand-new platform than a more capable model layer dropped into a system developers already use for voice apps. (openai.com) ### Why is GPT‑Realtime‑2 the interesting part? Because OpenAI is positioning it as the first realtime voice model in its API with “GPT‑5‑class” reasoning. In plain English, that means the model is supposed to do better on harder requests — longer context, trickier instructions, more back-and-forth — without forcing developers to leave the low-latency voice stack. The docs also show configurable reasoning effort(openai.com)utput token usage. (openai.com) ### What changed versus the older realtime model? The older general-availability model, GPT‑Realtime, was already built for production voice agents and shipped in 2025 with improvements in audio quality, instruction following, and function calling. GPT‑Realtime‑2 keeps the same headline text-token price — $4 per 1M input tokens and $24 per 1M output tokens — but adds the explicit reasoning control that makes the(openai.com)pers: pick your speed, then decide how much thinking you can afford. (openai.com) ### Why do translation and transcription matter so much? Because a lot of real voice products are not just “assistant talks to person.” They’re meetings, call centers, interpreters, and note-taking systems. Realtime Translate is for streaming speech translation as someone is still talking. Realtime Whisper is for streaming transcripts with controllable latency — lower delay for faster partial text, higher delay for better quality. That combin(openai.com)customer support plausible, not just flashy demos. (developers.openai.com) ### What’s the catch with realtime voice? Latency is only one problem. Attribution is another. OpenAI’s own Realtime docs note that transcription runs as a separate ASR process and can diverge from the model’s interpretation of the audio. That sounds subtle, but it matters a lot in enterprise settings. If the spoken transcript, the translated text, and the agent’s spoken reply are not perfectly aligned, users start asking which layer(developers.openai.com)t. (platform.openai.com) ### Why do interruptions matter? Because natural conversation is messy. People cut in. They change their mind mid-sentence. The Realtime API already has explicit mechanisms for handling interrupted audio and truncating speech that was generated but not yet played. That’s a big clue about where the hard product work lives. The hard part is not just generating a fluent answer. It’s knowing when to stop talking, when to revise, and how not to bulldoze the human. (openai.com) ### So who is this really for? Developers building customer support agents, meeting tools, interpreters, phone systems, and voice interfaces that need both speed and judgment. OpenAI also has a voice-agents guide in its developer stack now, which suggests the company sees this as a broader agent workflow — not a standalone speech feature. Voice is becoming another front end for tool-using AI systems. (developers.openai.com)I did not just release a better talking model. It pushed reasoning deeper into the realtime layer. That opens the door to more capable voice agents, but it also makes product choices around delay, interruption, and source-of-truth much more important. The model got smarter. The UI problem got harder.