OpenAI launches GPT‑Realtime-2 API

- OpenAI rolled out three live audio API models on May 7 — GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper — aimed at production voice apps. - The standout detail is translation scope and pricing: 70+ input languages, 13 output languages, and GPT‑Realtime‑Translate billed at $0.034 per audio minute. - This pushes OpenAI beyond chatbot voice mode into infrastructure for call centers, interpreters, and transcription tools built around low-latency streaming.

OpenAI just turned its realtime audio stack into something much more specific. Not just “talk to an assistant,” but three separate tools for three separate jobs — a voice agent, a live interpreter, and a streaming transcription engine. That matters because those use cases break in different ways, and one generic voice model usually isn’t great at all of them. On May 7, OpenAI split the lineup and launched GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper for developers building production apps. ### What actually launched? The release is three new API models. GPT‑Realtime‑2 is the general voice assistant model — the one for back-and-forth conversations, tool use, and harder spoken requests. GPT‑Realtime‑Translate is a dedicated live translation model. GPT‑Realtime‑Whisper is for streaming speech-to-text. OpenAI pitched them as a new generation of voice models rather than a small refresh of the old realtime setup. (openai.com) ### Why split them up? Because “voice” is really three products hiding under one label. A voice agent has to reason, remember context, and maybe call tools. A translator has to stay close to the speaker and not start acting like a chatbot. A transcription model has to emit text fast, then keep cleaning it up as more audio arrives. OpenAI’s own docs now draw that line pretty explicitly — use GPT‑Realtime‑2 for assistant behavior, and use the translation endpoint when the app should behave like an interpreter instead. (openai.com) ### What is GPT‑Realtime‑2 for? Basically, it is the flagship voice agent model. OpenAI says it brings “GPT‑5‑class reasoning” into realtime audio, with configurable reasoning effort so developers can trade latency for smarter answers. The pricing page also shows it keeping the same text-token price as gpt‑realtime‑1.5 — $4 per 1M input tokens and $24 per 1M output tokens — while charging separately for audio tokens. That tells you the pitch: better reasoning without forcing every app into a brand-new cost bracket. (developers.openai.com) ### What is different about the translation model? Turns out this one is not just “the assistant, but in another language.” It runs on a different endpoint, `/v1/realtime/translations`, and OpenAI frames it as an interpreter rather than a conversational agent. The model translates speech from 70+ input languages into 13 output languages while trying to keep pace with the speaker. It is also priced by audio duration, not text tokens — $0.034 per minute. That is a pretty clear sign OpenAI expects telephony, meetings, and live multilingual support workflows. (developers.openai.com) ### What about GPT‑Realtime‑Whisper? This is the “just give me the words” model. OpenAI says it is for realtime transcription with controllable latency — lower delay for earlier partial text, higher delay for better accuracy. That is useful if you are building captions, meeting notes, or call analytics, where the tradeoff is not intelligence versus intelligence but speed versus transcript quality. The company is basically productizing a knob developers already care about. (developers.openai.com) ### Is this replacing older audio APIs? Not exactly. File transcription and standard speech APIs still exist, and the older Audio API FAQ remains live. But the new docs make the architecture clearer: bounded audio files go through transcription endpoints, while open, live sessions go through the Realtime stack over WebRTC or WebSocket. So this is less a replacement than a cleanup of the product map. ### Why does this matter now? (developers.openai.com) Because the bottleneck has shifted. The hard part is no longer just getting a model to hear speech. It is making the model behave correctly for the job in front of it — assistant, interpreter, or transcriber — without weird latency, wrong incentives, or extra glue code. OpenAI is now selling those behaviors as separate primitives. That makes the API more legible for developers, and it nudges voice AI closer to boring infrastructure — which, for enterprise adoption, is usually the point. (help.openai.com) ### Bottom line? This launch is really about specialization. OpenAI did not just add a shinier voice model — it carved realtime audio into three production roles and priced them like tools developers can actually plan around. If that holds up in the wild, the winners are not just flashy demo apps. They are the companies quietly wiring translation, call handling, and transcription into everyday software. (openai.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.