OpenAI Launches 'gpt-realtime-1.5' for Sub-Second Responses
OpenAI has made `gpt-realtime-1.5` available in its Realtime API, a model optimized for sub-second response times. The release targets use cases requiring fast, interactive experiences, such as customer support chat, collaborative editing tools, and live gaming. This API is distinct from the standard API and is designed for applications where low latency is critical.
- The `gpt-realtime-1.5` model shows specific performance gains over its predecessor, including a 10.23% improvement in alphanumeric transcription accuracy, a 7% increase in instruction following, and a 5% boost on the Big Bench Audio reasoning benchmark. - Pricing for the new model remains the same as the original `gpt-realtime` API, structured per million tokens: Text costs $4 for input and $16 for output, while audio is priced at $32 for input and $64 for output. - Early adopters have reported significant performance improvements; for instance, the startup Genspark saw its phone call error rates cut in half, and Sendbird noted exceptional enhancements in the model's ability to handle conversational interruptions. - For browser-based applications, OpenAI recommends using its Agents SDK for TypeScript, which connects to the model via WebRTC for more consistent performance compared to the server-side WebSocket connections. - The underlying architecture avoids the traditional, higher-latency "daisy chain" of separate speech-to-text, LLM, and text-to-speech services by using a single native multimodal model where audio is the input and audio is the output. - Key competitors in the low-latency voice AI space include Google's Gemini Live API, which offers a similar all-in-one model, and more specialized providers like Hume AI, which focuses on detecting emotion and vocal tone. - While the Realtime API provides an integrated solution, developers have the flexibility to use it for its real-time speech recognition and then route the text output to a third-party text-to-speech (TTS) provider for more control over the final voice.