OpenAI scales voice AI to 900M
- OpenAI published a May 4 engineering post explaining how it rebuilt ChatGPT voice infrastructure for more than 900 million weekly active users. - The key move was a split relay-plus-transceiver WebRTC design that keeps session ownership stable while cutting setup time and media latency. - That matters because voice AI breaks on tiny delays — and OpenAI is now treating realtime speech as core infrastructure.
Voice AI sounds simple from the outside. You talk, the model talks back. But the whole experience falls apart if the network adds even a small delay — people interrupt each other, barge-in fails, and the conversation starts feeling robotic fast. That is the backdrop for OpenAI’s new engineering write-up from May 4, which lays out how it rebuilt the WebRTC stack behind ChatGPT voice and the Realtime API to serve more than 900 million weekly active users. ### Why is voice harder than text? Text systems can hide latency. A chatbot can think for a second and still feel usable. Voice cannot. OpenAI’s team says the requirements are fast connection setup, low and stable media round-trip time, and low jitter and packet loss so turn-taking still feels natural. In other words, the product target is not just “works” — it is “feels like speech.” ### Why use WebRTC at all? WebRTC is the standard stack browsers and mobile apps already use for realtime audio and video. It handles ugly but necessary things like NAT traversal, encryption, codec negotiation, echo cancellation, and jitter buffering. That matters because OpenAI does not want every client app to reinvent transport just to talk to server side plugs into model inference. ### What broke at OpenAI’s scale? Three constraints started colliding. One-port-per-session media termination did not fit OpenAI’s infrastructure well. ICE and DTLS sessions are stateful, so they need stable ownership instead of bouncing around the fleet. And global routing had to keep first-hop latency low even as usage spread worldwide. Basically had to operate at ChatGPT scale. ### So what did OpenAI change? The company says it built a split relay-plus-transceiver architecture. The relay handles the edge-facing part of the session, while the transceiver handles media processing deeper in the system. That split lets OpenAI preserve standard WebRTC behavior for clients without forcing every media session to terminate in the transceiver, separating concerns so routing, ownership, and media handling can scale independently. ### Why does “stable ownership” matter so much? Realtime sessions are stateful. OpenAI’s developer docs spell that out pretty clearly: a session has configuration, conversation items, responses, and event flows that persist over time, with sessions lasting up to 60 minutes. If packets or control state land on the wrong machine at the wrong moment. That is why stable session placement is not an optimization. It is table stakes for spoken interaction. ### Is this just for ChatGPT voice? No — and that is the bigger story. OpenAI explicitly ties the work to ChatGPT voice, developers using the Realtime API, and agents in interactive workflows. So this is really infrastructure for a broader product direction where models listen, speak, and call tools inside a single live session instead of stitching together separate speech-to-text, LLM, and text-to-speech steps. ### Why should product teams care? Because this is the trade-off map for anyone building meeting assistants, phone agents, tutors, or in-app voice controls. The lesson is not “use OpenAI’s exact stack.” It is that low-latency voice becomes a systems problem before it becomes a model problem. Network topology, session ownership, and transport design. ### Bottom line? OpenAI is signaling that voice is no longer a demo feature sitting on top of AI. It is core infrastructure now — and once you are serving hundreds of millions of people, milliseconds start acting like product bugs.