OpenAI ships GPT‑Realtime‑2 API
- OpenAI launched GPT‑Realtime‑2 on May 7, 2026, adding a new speech‑to‑speech API model for voice agents that reason while audio streams in. (openai.com) - The key change is configurable reasoning effort inside realtime voice sessions, plus new sibling models for live translation and streaming transcription. (developers.openai.com) - This pushes voice apps beyond fast chatbots into tool-using agents — but with sharper latency, cost, and session-design tradeoffs. (developers.openai.com)
Voice AI is shifting from “talk back quickly” to “actually think while talking.” That’s the real point of OpenAI’s GPT‑Realtime‑2 launch on May 7. OpenAI didn’t just ship another audio model — it added a new speech‑to‑speech model for the Realtime API that can reason during live conversations, plus separate models for translation and transcription. (openai.com) For developers, that changes what a voice agent can be. It also changes what a production system has to manage. (developers.openai.com) ### What actually shipped? OpenAI released three new audio models in the API: GPT‑Realtime‑2 for realtime voice agents, GPT‑Realtime‑Translate for live speech translation, and GPT‑Realtime‑Whisper for low-latency transcription. (developers.openai.com) The release also expanded the relevant endpoints — `v1/realtime`, `v1/realtime/translations`, and `v1/realtime/transcription_sessions` — so this was a platform update, not just a model-name refresh. ### What is GPT‑Realtime‑2 supposed to do? Basically, it is the “harder conversations” model. OpenAI positions GPT‑Realtime‑2 for voice agents that need stronger reasoning, better tool selection, exact entity handling, and longer session state. (openai.com) That means the target use case is not a simple voice bot reading canned replies. It is a live agent that hears speech, keeps context, decides whether to call tools, and answers out loud without breaking the flow. ### Why is this different from older voice APIs? Older stacks often stitched together speech recognition, a text model, and text-to-speech. That works, but it adds delay and makes interruptions awkward. (openai.com) The Realtime API is built around persistent low-latency connections over WebRTC, WebSocket, or SIP, with native audio input and output. GPT‑Realtime‑2 sits inside that setup and adds reasoning as part of the live speech workflow, which is the new bit. ### What is the catch with “reasoning” in realtime? Reasoning is not free. OpenAI’s model page says GPT‑Realtime‑2 supports configurable reasoning effort, and higher effort can increase latency and output token usage. (developers.openai.com) That is the core tradeoff — smarter live responses can mean slower replies and higher bills. For a voice product, even small delays feel bigger than they do in text chat, so teams now have to tune intelligence against conversational smoothness. ### What does this mean for platform teams? It means voice is becoming an orchestration problem. A production app now has to manage streaming audio, session state, tool permissions, confirmation boundaries for actions, and event handling across a long-lived connection. (platform.openai.com) OpenAI’s own guidance tells developers to start with low reasoning effort, test default preambles, and define clear rules before any write action. That reads less like chatbot prompting and more like realtime systems engineering. ### Why launch translation and transcription too? Because a voice stack is rarely just one model. Realtime agents often need live captions, multilingual handoff, or side-channel transcripts for logging and QA. (developers.openai.com) OpenAI is bundling those adjacent jobs into sibling models — one for streaming translation, one for streaming speech-to-text — so developers can build a fuller voice pipeline inside the same product family. ### So what changed versus last year? Last year’s story was getting realtime voice into production at all. OpenAI’s earlier push centered on `gpt-realtime`, GA for the Realtime API, and features like MCP support, image input, and SIP calling. (developers.openai.com) This week’s shift is more ambitious — not just realtime speech, but realtime speech with configurable reasoning. In plain English, the model is being asked to think more like a general agent while it talks. ### Bottom line GPT‑Realtime‑2 matters because it moves voice AI closer to a usable operator, not just a fast narrator. But the upgrade comes with a very practical bill — more tuning, more observability, and more care around latency. (openai.com) Voice apps just got more capable. They also got more like infrastructure. (openai.com)