Efficient Open Source TTS Model
A new open-source text-to-speech model named Kani-TTS-2 has been released. The 400M parameter model is notable for its efficiency, requiring only 3GB of VRAM to run and including support for voice cloning.
- The model utilizes a two-stage architecture, first using a 350M parameter LiquidAI LFM2 backbone to generate 'audio intent' as discrete tokens, and then using an NVIDIA NanoCodec to convert those tokens into a 22kHz waveform. - Its zero-shot voice cloning capability works by generating speaker embeddings from a short reference audio clip, allowing it to replicate a voice without any fine-tuning. - The English version was trained on 10,000 hours of high-quality speech data in just 6 hours using a cluster of 8 NVIDIA H100 GPUs. - It achieves a Real-Time Factor (RTF) of approximately 0.2, enabling it to generate 10 seconds of audio in about 2 seconds on consumer-grade hardware. - Developed by the team at nineninesix.ai, Kani-TTS-2 is released under an Apache 2.0 license, permitting commercial use and integration. - The model is available on Hugging Face in both English and Portuguese versions, and the developers have also released the full pre-training framework to allow for training on new languages or accents. - Current limitations include a degradation in performance on inputs longer than 40 seconds and a lack of a true streaming implementation, meaning the time to first audio is the total generation time for the input text.