OpenAI ships three realtime voice models
- OpenAI launched three API voice models on May 7: gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper, aimed at live assistants, interpreters, and transcription apps. - The clearest detail is the language split: translation takes 70+ spoken input languages and renders them live into 13 output languages. - This pushes OpenAI’s Realtime API from chatty demos toward deployable meeting, travel, support, and captioning products.
Voice AI is getting more specialized — and more usable. OpenAI just split its realtime stack into three separate models for three separate jobs: one for live voice agents, one for live translation, and one for streaming transcription. That sounds like a packaging change, but it is really a product change. The gap before this was simple — one model could do a lot, but developers still had to bend it into interpreter mode or caption mode. On May 7, OpenAI made those modes first-class. ### What actually shipped? OpenAI’s new release has three models: gpt-realtime-2 for live spoken assistants, gpt-realtime-translate for speech-to-speech interpretation, and gpt-realtime-whisper for low-latency speech-to-text. The company positioned them as API building blocks, not consumer features — so this is for app makers, device makers, and teams building call tools, meeting tools, kiosks, or multilingual assistants. ### Why split them up? Because “talk back,” “translate,” and “transcribe” are not the same problem. A voice assistant needs to hold context, decide what the user wants, maybe call tools, and answer naturally. A translator should not “help” — it should just interpret faithfully and keep pace. A transcription model has a narrower job still: turn live audio into text fast, with partial updates arriving before the speaker is done. OpenAI’s docs now reflect that separation pretty explicitly. (openai.com) ### What is gpt-realtime-2 for? This is the assistant model — the one meant to listen, reason, and respond in real time. OpenAI describes it as its first voice model with GPT-5-class reasoning, and the model page says developers can tune reasoning effort, trading more thinking for more latency and token use. Basically, this is the “smart conversation” lane, where the model is supposed to do more than just echo or route speech. (developers.openai.com) ### What is different about the translation model? gpt-realtime-translate runs on a different session type and acts as an interpreter rather than an assistant. That matters more than it sounds. In the older setup, developers often had to prompt a general realtime model very carefully so it would translate instead of answering. Now OpenAI has a dedicated translation endpoint, and it says the model can take speech from 70+ input languages and render it live into 13 output languages. (openai.com) Pricing is also different — billed by audio minute, not text tokens. ### And the Whisper model? gpt-realtime-whisper is for streaming transcription while someone is still speaking. The useful detail here is controllable latency — developers can choose faster partial text or wait slightly longer for better transcript quality. That makes it a better fit for captions, meeting notes, and any interface where text needs to appear live but still be readable. (developers.openai.com) ### Why does this matter beyond demos? Because realtime voice systems usually fail at the boring parts — delay, interruptions, wrong mode, and multilingual messiness. A dedicated interpreter model and a dedicated transcription model reduce some of that friction. The result is less “AI talking to itself” energy and more practical software: travel interpreters, multilingual customer support, live captions, and meeting products that do not need a human to clean everything up afterward. (developers.openai.com) That is the bet, anyway. ### What changed from last year? Last year’s push was about making realtime voice production-ready at all. This week’s push is about specialization. OpenAI already had a general realtime model and the Realtime API in market, but now it is carving the stack into clearer jobs with clearer interfaces. That usually means a platform is maturing. First you prove the trick works. Then you stop making everyone use the same trick for everything. (openai.com) ### Bottom line The big story is not just that OpenAI shipped three new voice models. It is that voice AI is moving from one impressive general demo toward a set of narrower, more dependable tools — and that is usually when developers start building things people actually use. (openai.com)