GPT-Realtime-2 builds Twilio voice agent
- OpenAI’s GPT-Realtime-2 can now be wired into Twilio phone calls using OpenAI’s Twilio transport and Realtime session stack, according to May 2026 documentation. - OpenAI said on May 7 GPT-Realtime-2 is its first voice model with “GPT-5-class reasoning,” while Twilio integration documentation remains marked beta. - Developers can follow OpenAI’s Twilio transport docs and Twilio’s voice assistant tutorial to build and test phone-based agents.
OpenAI’s latest voice stack is now documented for phone-call use with Twilio, giving developers a clearer path from browser demos to live telephony systems. OpenAI on May 7 introduced GPT-Realtime-2, describing it as its first voice model with “GPT-5-class reasoning” and positioning it for harder spoken requests and tool use during live conversations. Twilio and OpenAI documentation together show how that model can be connected to Twilio Voice Media Streams through a WebSocket server, with audio sent in both directions during a call. The setup turns a speech model into a phone agent that can listen, respond, call tools and manage interruptions while the call stays live. ### How does the phone call actually reach GPT-Realtime-2? Twilio’s Media Streams API sends raw call audio to a developer-run WebSocket server, and OpenAI’s Twilio transport is built to bridge that stream into the Realtime API. OpenAI’s documentation says developers can connect the incoming Twilio WebSocket to a `TwilioRealtimeTransportLayer`, then attach that transport to a `RealtimeSession` that uses a `RealtimeAgent`. That session can then connect to the OpenAI Realtime API with an API key and expose the same behaviors available in other realtime sessions, including tool calls and guardrails. (openai.com) Twilio’s August 2025 tutorial describes the same basic architecture in Python with FastAPI: a Twilio Media Stream server receives phone-call audio, forwards it to OpenAI’s Realtime API, and sends the model’s audio response back to Twilio for playback to the caller. That tutorial predates GPT-Realtime-2, but the call flow matches the transport pattern OpenAI now documents for the newer model. (openai.github.io) ### Why is this different from a browser voice demo? OpenAI’s voice-agent documentation distinguishes browser-first WebRTC setups from server-side transports such as WebSocket, SIP and Twilio. In the browser flow, OpenAI recommends short-lived ephemeral client tokens and WebRTC connections. In a Twilio flow, the call arrives through Twilio’s telephony system first, and the developer has to run a server that can accept Twilio’s WebSocket connection. (twilio.com) OpenAI says phone calls introduce more latency than web-based conversations, which is why its Twilio transport handles audio forwarding and interruption timing for developers. The company’s transport guide says the dedicated Twilio adapter is the better default when a team wants the SDK to manage interruption behavior for Twilio Media Streams. ### Where do tool calls and event handling fit in? (openai.github.io) OpenAI’s Voice Agents SDK wraps the underlying realtime event flow in `RealtimeAgent`, `RealtimeSession` and transport helpers, rather than removing that event model. The documentation says the SDK keeps the Realtime API mental model intact while making tools, guardrails, handoffs, tracing and session history easier to manage in spoken interfaces. For a phone agent, that means a developer can keep the call connected while the model invokes functions or other hosted tools in the background. (openai.github.io) OpenAI’s Twilio page says any event and behavior expected from a `RealtimeSession` should work with the Twilio transport, including tool calls and guardrails. The same page says developers can inspect raw Twilio messages through transport events and enable SDK debug logging with an environment variable for troubleshooting. ### What does “barge-in” mean in this stack? OpenAI’s Twilio transport documentation says the adapter handles interruption timing using Twilio’s mark events. (openai.github.io) In practice, that is the mechanism that lets a live caller cut off the assistant mid-response and start speaking again without waiting for the audio to finish. OpenAI also says the session lifecycle for voice agents includes interruptions and voice activity detection, both of which matter more on a phone line where callers talk over prompts and pauses are common. (openai.github.io) The same docs warn that speed matters. OpenAI tells developers to create the Twilio transport as soon as they get the WebSocket reference and call `session.connect` immediately afterward so the system can receive the necessary Twilio events and audio from the start of the call. ### What are the practical constraints developers need to know? OpenAI’s model page says GPT-Realtime-2 supports configurable reasoning effort and warns that higher reasoning effort can increase latency and output token use. (openai.github.io) For a phone agent, that creates a direct trade-off between deeper reasoning and faster turn-taking. OpenAI’s Twilio integration page also labels the adapter beta and says developers may encounter edge-case issues or bugs. (openai.github.io) The setup requires a Twilio account, a Twilio phone number, and a WebSocket server reachable by Twilio; OpenAI’s docs suggest ngrok or Cloudflare Tunnel for local development, while Twilio’s tutorial lists similar prerequisites in Python. May 2026 documentation from OpenAI points developers to the Twilio transport guide for phone calls and to the broader Voice Agents guides for tools, history and tracing. (developers.openai.com) Twilio’s published tutorial remains a working reference for the server side, while OpenAI’s current model page lists GPT-Realtime-2 pricing and configuration details for teams moving from a demo call to a deployed system. (openai.github.io)