AssemblyAI launches voice agent API
- AssemblyAI launched a Voice Agent API on April 29 that bundles speech recognition, reasoning, and speech output into one real-time WebSocket endpoint. - The headline number is $4.50 per hour all-in, with built-in tool calling, interruption handling, turn detection, and streamed PCM16 audio replies. - It shifts AssemblyAI from speech components toward full agent infrastructure for teams that want faster deployment, not custom orchestration. (assemblyai.com)
Voice agents are the next layer up from speech-to-text APIs. They do the whole loop — hear you, decide what you meant, maybe call a tool, then talk back. The problem is that most teams still have to glue that stack together themselves. On April 29, AssemblyAI said it wants to remove that plumbing step by launching a Voice Agent API that wraps the whole thing behind one WebSocket connection. (assemblyai.com)-time voice conversation endpoint. Audio goes in, audio comes back out, and the service handles speech recognition, response generation, and voice synthesis in the middle. AssemblyAI is pitching it as a native voice-agent product, not just another transcription API with extra steps. (assemblyai.com) ### Why is one WebSocket a big deal? (assemblyai.com)r separate speech-to-text, LLM, and text-to-speech providers, then add turn detection, interruption logic, and tool wiring on top. AssemblyAI’s pitch is basically: stop orchestrating three or four systems and just connect once. (assemblyai.com) ### What does the API include? The (assemblyai.com)ction, configurable turn detection, barge-in so users can interrupt the agent mid-response, native audio output, and tool calling through structured events. The quickstart example shows tools like weather and time being registered and answered inside the same session flow. (assemblyai.com)er hour, billed as one all-in rate. That covers speech understanding, LLM reasoning, and voice generation. That pricing matters because it turns a multi-vendor architecture problem into a single usage line item — at least for teams willing to trade some stack control for speed. (assemblyai.com) ### What models is this built (assemblyai.com)ning layer. That model is already AssemblyAI’s higher-end real-time speech product, and the company keeps framing the new API around one argument: voice agents fail when they mishear users or cut them off at the wrong time. In other words, the “ears” matter as much as the language model. (assemblyai.com)parison page makes the split pretty explicit. If you already have your own LLM and TTS stack, the company still wants you on Universal-3 Pro Streaming alone. But if you want the fastest path to a working voice agent, the new API is the shortcut product — one connection, fewer moving parts, and less infrastructure to maintain. (assemblyai.com)urther up the stack. AssemblyAI built its name on speech recognition and speech understanding APIs. A full voice-agent endpoint means it is no longer just selling the listening layer — it is selling a packaged conversational system. That puts it closer to the platform layer where developers choose between control and convenience. (assemblyai.com)great when you want to ship fast, but less great if you want to swap every component, tune each model separately, or optimize costs across providers. AssemblyAI is openly positioning the Voice Agent API as the simpler option, while keeping standalone streaming speech for teams that want deeper control. (assemblyai.com)ck into a full voice-agent product. For developers building phone bots, support agents, or live assistants, that can cut weeks of integration work down to one API connection. (assemblyai.com)