Ant Group Demos Unified Audio Generation Model
Ant Group has demonstrated a unified model for audio generation capable of simultaneously producing speech, music, and sound effects. The technology offers precise vocal control over pitch and emotion via text prompts and can generate over 100 voices with zero-shot design. The model uses an efficient "Patch-by-Patch" strategy, positioning it as a potentially powerful tool for agency audio workflows.
- The model is a component of Ant Group's broader Ming-Flash-Omni 2.0, which is an open-source, multi-modal model. This larger framework is built on a Mixture-of-Experts (MoE) architecture with 100 billion total parameters, of which 6 billion are active. The project is spearheaded by Zhou Jun, the head of Ant Group's Ling model team. - The "Patch-by-Patch" compression strategy is a key innovation for efficiency, reducing the model's inference frame rate to an ultra-low 3.1Hz. This allows for the real-time, high-fidelity generation of audio that is minutes long, a significant advantage for production workflows. - Its zero-shot voice cloning allows for the creation of new voices with minimal data input, unlike traditional methods that require extensive recordings of a target speaker. This is achieved by leveraging a pre-trained model that understands general voice characteristics and can adapt them to a new voice from a small sample. - The unified model is capable of generating speech, ambient sound effects, and music within a single audio track, providing a more integrated and seamless auditory experience. This is a departure from workflows that rely on separate, specialized models for each audio type. - On specific industry benchmarks, the model has demonstrated strong performance. For instance, its Cantonese dialect control achieves 93% accuracy, and its emotional expressiveness on certain test sets surpasses that of competitors like CosyVoice3. In some multi-modal benchmark tests, the broader Ming-Flash-Omni 2.0 model has even surpassed Google's Gemini 2.5 Pro. - The model's code and weights are publicly available on open-source platforms like Hugging Face, allowing creative and development teams to directly integrate and build upon the technology.