OpenAI ships realtime voice models
- OpenAI on May 7 launched three realtime API voice models — GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper — for live speech agents, translation, and transcription. - The flagship model adds configurable reasoning in speech-to-speech mode, while the translation model handles 70+ input languages into 13 output languages. - The release lands as OpenAI pushes broader consumer voice tools but keeps stronger cyber models behind its vetted Trusted Access program.
Voice AI has had a weird problem for a while. It could sound natural, or it could think carefully, or it could transcribe quickly — but usually not all three at once. OpenAI’s May 7 release is basically an attempt to collapse those tradeoffs into one stack. It shipped three new realtime audio models in the API, aimed at developers building apps that need to listen, respond, translate, and keep up with a live conversation. (openai.com) ### What actually shipped? OpenAI added GPT‑Realtime‑2 for live speech-to-speech interaction, GPT‑Realtime‑Translate for streaming translation, and GPT‑Realtime‑Whisper for streaming speech-to-text. They plug into the Realtime API and related endpoints, so this is a developer release first — not just a ChatGPT feature refresh. (openai.com) trying to do the hard version of voice — not just talk fast, but reason while talking. OpenAI says GPT‑Realtime‑2 supports configurable reasoning for speech-to-speech agents, which means developers can tune how much deliberation the model uses before responding. That matters for customer support, tutoring, assistants, and other voice apps where a glib answer is worse than a slightly slower but smarter one. (openai.com) ### What’s different about the translation model? Most voice translation systems feel like a relay race — one model hears you, another turns speech into text, another translates, and another speaks back. GPT‑Realtime‑Translate is meant to feel more like an interpreter sitting in the room. OpenAI says it can take speech from 70+ input languages and render it into 13 output languages in realtime, whi(openai.com)ffline captioning. (openai.com) ### And what is Realtime Whisper for? That one is the cleanest to understand. It is for streaming transcription sessions — cases where an app needs text from audio but does not need the model to answer back. Think call transcripts, meeting notes, live captions, and voice logging. OpenAI’s docs split that workflow from full voice-agent sessions, which suggests the company wants developers to choose a lighter tool when they only need recognition. (openai.com) ### Is this the same thing as ChatGPT voice mode? Not exactly. The release is in the API, so it is for builders. But it lines up with a broader OpenAI shift toward mixing faster everyday models with deeper reasoning when needed. In ChatGPT, “Instant” can automatically route harder requests to stronger reasoning models, and OpenAI recently updated GPT‑5.5 Instant as its default daily-driver model. T(openai.com)t. (openai.com) ### Why does the cyber model matter here? Because it shows the split in OpenAI’s rollout strategy. On one side, the company is widening access to realtime consumer and developer voice tools. On the other, it is keeping higher-risk cybersecurity capability behind Trusted Access for Cyber, a vetted program for defenders and security teams. OpenAI said this week that GPT‑5.5’s cyber capabilities are being delivered through that screened access path, not opened broadly. (openai.com) ### So what changed for developers today? The practical change is that developers now get a more segmented menu. One model for conversational voice with reasoning. One for live translation. One for pure streaming transcription. That sounds incremental, but it is how platforms mature — less one-size-fits-all, more purpose-built tools that map to real product jobs. (openai.com)ng voice less like a demo and more like infrastructure. The new models suggest the company thinks realtime speech apps are moving from novelty into product category — but with a clear line between mass-market voice features and tightly gated high-risk AI. (openai.com)