Efficient Open Source TTS Model

A new open-source text-to-speech model named Kani-TTS-2 has been released. The 400M parameter model is notable for its efficiency, requiring only 3GB of VRAM to run and including support for voice cloning.

- The model utilizes a two-stage architecture, first using a 350M parameter LiquidAI LFM2 backbone to generate 'audio intent' as discrete tokens, and then using an NVIDIA NanoCodec to convert those tokens into a 22kHz waveform. - Its zero-shot voice cloning capability works by generating speaker embeddings from a short reference audio clip, allowing it to replicate a voice without any fine-tuning. - The English version was trained on 10,000 hours of high-quality speech data in just 6 hours using a cluster of 8 NVIDIA H100 GPUs. - It achieves a Real-Time Factor (RTF) of approximately 0.2, enabling it to generate 10 seconds of audio in about 2 seconds on consumer-grade hardware. - Developed by the team at nineninesix.ai, Kani-TTS-2 is released under an Apache 2.0 license, permitting commercial use and integration. - The model is available on Hugging Face in both English and Portuguese versions, and the developers have also released the full pre-training framework to allow for training on new languages or accents. - Current limitations include a degradation in performance on inputs longer than 40 seconds and a lack of a true streaming implementation, meaning the time to first audio is the total generation time for the input text.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.