Rapid voice agents via STT–LLM–TTS loops
dTelecom demonstrated rapid prototyping of voice agents using STT→LLM→TTS loops over real‑time infra, sidestepping custom media stacks for faster production builds — the company showed how composable pipelines accelerate voice agent dev. That pattern shortens the path from PoC to usable voice features by reusing existing real‑time components.
dTelecom’s agents-js framework on GitHub documents code that lets voice agents join WebRTC rooms and orchestrate STT→LLM→TTS flows inside conference sessions. github.com dTelecom’s STT offering advertises a dual‑engine stack—Parakeet‑TDT (3–4× faster, focused on ~25 European languages) with Whisper fallback (99+ languages) and smart routing for failover. x402stt.dtelecom.org The project is built as a Solana DePIN service layer for real‑time comms, positioning itself as decentralized infrastructure for voice/video/chat and citing Solana DePIN growth metrics from mid‑2025. dtelecom.org Industry latency baselines show STT at ~100–500 ms, LLM inference typically 350 ms–1 s+, and TTS at ~75–200 ms with a human conversational target window of ~300–500 ms; those component sums explain why end‑to‑end stacks often exceed human turn targets. introl.com Co‑location and composable pipeline patterns are being used to cut hops: Together AI demonstrated co‑located STT+LLM+TTS deployment claiming end‑to‑end latency under 700 ms, while LiveKit and Gladia docs recommend early turn detection and concurrent STT/LLM/TTS workers to shave hundreds of milliseconds. together.ai An academic telecom prototype (arXiv, Aug 5, 2025) showed a production‑oriented stack that combines streaming ASR, a 4‑bit quantized LLM, and retrieval‑augmented generation (RAG) to meet low‑latency, knowledge‑grounded voice use cases—an architecture that maps directly to dTelecom’s reuse‑of‑real‑time components approach. arxiv.org