100ms local TTS on M4
Benchmarks show local text‑to‑speech on Apple Silicon M4 hitting ~100ms latency with zero marginal cost after setup — presenting a strong case for real‑time, privacy‑preserving voice AI on device. The post compares this favorably to cloud services like ElevenLabs for low‑latency pipelines (x.com).
Open‑source Qwen3‑TTS and several community forks publicly claim sub‑100ms first‑packet streaming for their lightweight TTS models (reported figures ~97–100ms) in project pages and READMEs. (qwen3-tts.org github.com/runvnc/qwen3tts ) A community M4 test report using mlx-audio on a Mac mini M4 measured TTS generate_audio latencies around ~1,049ms warm (and multi‑second cold starts), highlighting that end‑to‑end pipeline and warm/cold behavior can swamp raw model inference time. Pocket‑TTS, a 100M‑parameter CPU‑only TTS library, reported ~200ms first‑chunk latency and approximately 6× real‑time throughput on a MacBook Air M4, demonstrating that small CPU models frequently land in the 100–300ms regime rather than high‑end sub‑100ms. (aibit.im ) A comparative blog test found Qwen3‑TTS delivering ~97ms first‑packet while ElevenLabs averaged around ~200ms in their runs, and estimated material subscription savings for heavy content generation when switching to local hosting. Alternative inference runtimes and libraries are being positioned as latency levers: MetalRT claims multi‑x speedups over Apple’s MLX on STT/TTS workloads, and MLX‑audio posts document streaming‑optimized engineering patterns for Apple Silicon. Deployment notes repeatedly call out concrete tuning knobs to reach lowest latency: prefer the 0.6B model for <6GB VRAM, the 1.7B for ≥12GB VRAM, pre‑load models to avoid 3–5s cold starts, use hybrid streaming generators for sub‑100ms first packets, and remove file/HTTP I/O from the hot path — all documented in community repos and GUIs.