OpenAI launches three realtime developer voice models, led by GPT‑Realtime‑2 and GPT‑Realtime‑Translate
- OpenAI on May 7 launched three developer voice models in its Realtime API: GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper. - The headline detail is GPT‑Realtime‑2 pricing: $32 per 1M audio input tokens and $64 per 1M audio output tokens, plus cached discounts. - The bigger shift is voice becoming a programmable app layer, not just a chatbot feature, if latency and costs hold.
Voice AI has had a pretty obvious problem for a while. It could sound smooth, but the moment you asked it to think, translate, or keep up with a real conversation, the illusion often broke. OpenAI is trying to fix that with three new developer models released on May 7 — GPT‑Realtime‑2 for live voice reasoning, GPT‑Realtime‑Translate for speech translation, and GPT‑Realtime‑Whisper for streaming transcription. The point is not just nicer demos. It’s to make speech usable as a real software interface. ### What actually launched? OpenAI added three models to its Realtime API. GPT‑Realtime‑2 is the flagship — a speech-to-speech model built for live conversations that can reason through harder requests while the user is still talking. GPT‑Realtime‑Translate handles live spoken translation, and GPT‑Realtime‑Whisper is for low-latency transcription that turns speech into text as it arrives. The company framed this as voice moving beyond simple turn-taking into systems that can listen, think, and act during the conversation. (openai.com) ### Why is GPT‑Realtime‑2 the main one? Because this is the model aimed at the hardest part of voice software — not hearing words, but doing something smart with them fast enough that the exchange still feels natural. OpenAI says GPT‑Realtime‑2 is its first voice model with “GPT‑5‑class reasoning,” and the developer d(openai.com) thinking. That matters because voice apps break the second the pause gets too long. (openai.com) ### What are the numbers? The pricing makes the launch feel real, not experimental. GPT‑Realtime‑2 is listed at $4 per 1M text input tokens, $24 per 1M text output tokens, $32 per 1M audio input tokens, and $64 per 1M audio output tokens. Cached input is much cheaper — $0.40 per 1M for text and audio — which is a big (openai.com)ontext so live voice apps do not become absurdly expensive. (developers.openai.com) ### What’s different about the translation model? GPT‑Realtime‑Translate is narrower, but that is the point. Instead of being a general voice assistant, it is tuned for one job — taking live speech in 70+ input languages and rendering it into 13 output languages while keeping pace with the speaker. That makes it more like infra(developers.openai.com)r field work where waiting for a full sentence batch would feel too slow. (openai.com) ### And the Whisper model? GPT‑Realtime‑Whisper is the transcription piece. It streams speech-to-text with controllable latency, so developers can choose faster partial captions or wait a bit longer for cleaner text. That sounds minor, but it is one of the load-bearing choices in voice products. Live captions, meeting(openai.com)e useful without becoming a mess. (developers.openai.com) ### Why does this matter beyond OpenAI? Because it pushes voice from feature to platform. OpenAI already had realtime voice tooling, but this release splits the stack into clearer jobs — reasoning, translation, and transcription — and gives developers more direct control over cost, latency, and architecture through WebRTC, WebSocket, (developers.openai.com)ne-size-fits-all demos. (platform.openai.com) ### What’s the catch? The catch is the same as always — latency, reliability, and cost. A voice agent can be impressive in a demo and still fail in production if it pauses too long, mistranscribes accents, or burns money on every turn. OpenAI’s own docs hint at that by exposing latency controls and warning developers to test wi(platform.openai.com) only feels natural when the engineering disappears. (developers.openai.com) ### Bottom line This launch is really about treating speech as a first-class computing interface. If GPT‑Realtime‑2 is good enough at live reasoning, and the translation and transcription models stay fast and cheap enough, startups can build voice products that do more than chat — they can onboard, search, book, guide, and hand off work in real time. (openai.com)