Lightweight Open-Source TTS Model 'Kani-TTS-2' Released
A new open-source text-to-speech model, Kani-TTS-2, has been released. The 400M parameter model supports voice cloning and requires only 3GB of VRAM to run. Its small footprint makes advanced voice synthesis viable for edge computing and other resource-constrained environments.
- Kani-TTS-2's architecture avoids traditional mel-spectrograms by treating audio as a language; it uses a two-stage process, first generating audio tokens with a 350M parameter LiquidAI LFM2 backbone, and then synthesizing a 22kHz waveform using an NVIDIA NanoCodec. - The model's zero-shot voice cloning does not require fine-tuning. Instead, it generates speaker embeddings from a short reference audio clip to instantly apply a voice's characteristics to the synthesized text. - Training was notably efficient, with the English model being trained on 10,000 hours of high-quality speech data in just 6 hours using a cluster of 8 NVIDIA H100 GPUs. - It achieves a Real-Time Factor (RTF) of 0.2, capable of generating 10 seconds of audio in approximately 2 seconds on consumer-grade GPUs like an RTX 3060. - The model is released by the team at nineninesix.ai under an Apache 2.0 license, permitting commercial use. - Along with the model, the developers have open-sourced the entire pretraining framework, including tools for FSDP multi-GPU training and Flash Attention 2, to enable users to train the TTS model on new languages or accents. - Current limitations include performance degradation on audio inputs that exceed 40 seconds and potential biases in prosody or pronunciation inherited from the training data. - The model is available on Hugging Face, with both English (Kani-TTS-2-EN) and Portuguese (Kani-TTS-2-PT) versions released.