Open-source VoxCPM2 voice model

A new open-source voice-cloning model called VoxCPM2 claims studio-quality output across 30 languages and runs locally on consumer GPUs, despite only having 2 billion parameters. (x.com) The social thread argues it outperforms some paid services like ElevenLabs and is positioned as a locally runnable alternative for multi‑language voice generation. (x.com)

A voice model called VoxCPM2 was released in April 2026 with open weights, 30-language speech generation, and local inference on consumer graphics cards. (github.com) Speech generation models turn text into audio, and voice-cloning systems try to copy a speaker’s tone, rhythm, and accent from a short sample. VoxCPM2’s developers say this version has 2 billion parameters, was trained on more than 2 million hours of multilingual speech, and outputs 48 kilohertz audio. (voxcpm.readthedocs.io) OpenBMB’s GitHub repository says VoxCPM2 can generate speech in 30 languages without a language tag, create a new voice from a text description, and clone a voice from a short reference clip. The project page also says it is built on a MiniCPM-4 backbone and released under the Apache 2.0 license for commercial use. (github.com) The Hugging Face model card says the package was published on April 7, 2026, and the Python package `voxcpm` was released on PyPI on April 8, 2026. The listed software requirements are Python 3.10 or newer, PyTorch 2.5.0 or newer, and CUDA 12.0 or newer. (huggingface.co, pypi.org) The local-run claim rests on speed numbers rather than a laptop demo. OpenBMB says real-time factor is about 0.3 on an NVIDIA RTX 4090 and about 0.13 with Nano-VLLM acceleration, which means the model can generate audio faster than playback on that hardware. (github.com) The model’s “tokenizer-free” design refers to how it handles audio internally. Instead of first compressing speech into a small vocabulary of audio tokens, the documentation says VoxCPM2 generates continuous speech representations through a four-stage pipeline that includes a local encoder, language model components, and a diffusion decoder. (openbmb.github.io, voxcpm.readthedocs.io) That matters because many recent text-to-speech systems have split between closed voice application programming interfaces and open models that are cheaper to run but narrower in language support or quality. VoxCPM2’s release combines open weights, commercial licensing, multilingual output, and cloning controls in one package. (huggingface.co, github.com) OpenBMB’s docs position VoxCPM2 as the default replacement for the VoxCPM 1 and 1.5 series, which previously topped out at 44.1 kilohertz output in the comparison table. The same docs say VoxCPM2 expands the training mix to 2.36 million hours, including 560,000 hours of added multilingual data. (voxcpm.readthedocs.io) The strongest performance claims now circulating online are still mostly anecdotal. OpenBMB provides a public demo page with samples in English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, Indonesian, Swahili, and German, but it does not publish a benchmark on that page comparing VoxCPM2 against ElevenLabs or other paid services. (openbmb.github.io, github.com) What is verified today is narrower and concrete: VoxCPM2’s code, weights, docs, and demos are public, and the maintainers are pitching a 2 billion-parameter model as a locally runnable multilingual voice stack rather than a hosted-only product. (github.com, huggingface.co)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.