VoxCPM2 beats ElevenLabs

An open‑source voice AI model, VoxCPM2, reportedly outperformed ElevenLabs on benchmarks, cloning voices from seconds of audio in 30 languages and running locally for free. (x.com)

Voice cloning is software that turns a short recording into a synthetic voice, and a new open-source model called VoxCPM2 says it can now do that across 30 languages on consumer hardware. (github.com) OpenBMB released VoxCPM2 on April 8, 2026 as a 2 billion-parameter text-to-speech model under an Apache 2.0 license, with code on GitHub and weights on Hugging Face. The project says it was trained on more than 2 million hours of multilingual speech data. (github.com, huggingface.co, pypi.org) The model card says VoxCPM2 can clone a voice from a short reference clip, generate a new voice from a text description, and output 48 kilohertz audio. OpenBMB also says it can stream in real time at a real-time factor of about 0.3 on an Nvidia RTX 4090, or about 0.13 with Nano-VLLM acceleration. (huggingface.co, github.com) The claim that VoxCPM2 “beats ElevenLabs” is narrower than it sounds. OpenBMB’s public benchmark table compares VoxCPM, the earlier version described in its September 29, 2025 arXiv paper, against other zero-shot text-to-speech systems on Seed-TTS-eval and CV3-eval, and the table shown publicly does not list ElevenLabs. (github.com, arxiv.org) That matters because ElevenLabs sells a hosted product, not an open model checkpoint, and its cloning system is split into two modes. Its site says Instant Voice Cloning can work from a 10-second recording, while Professional Voice Cloning trains a dedicated model and takes about 3 hours for English or 6 hours for multilingual voices. (elevenlabs.io, elevenlabs.io) ElevenLabs also prices those features by subscription tier. Its pricing page lists Instant Voice Cloning on the $5 Starter plan, Professional Voice Cloning on the $22 Creator plan, and 44.1 kilohertz pulse-code modulation audio output through the application programming interface on the $99 Pro plan. (elevenlabs.io) VoxCPM2’s pitch is different: download the weights, install Python 3.10 or newer with PyTorch 2.5.0 or newer and CUDA 12.0 or newer, and run it yourself. The docs say device selection defaults to graphics processor, Apple Metal, or central processor automatically, and the first run downloads the model weights. (huggingface.co, readthedocs.io) The technical idea is to skip the usual “speech tokenizer,” which is a compression step that turns audio into discrete symbols before generation. OpenBMB says VoxCPM instead generates continuous speech representations directly through a diffusion-autoregressive architecture, and its 2025 paper argued that token-based systems trade away some expressiveness for stability. (github.com, arxiv.org) OpenBMB says the new release supports Arabic, Chinese, English, French, German, Hindi, Japanese, Korean, Spanish, Vietnamese and 20 other languages, plus several Chinese dialects. ElevenLabs says its voice cloning can speak 32-plus languages, and its latest flagship text-to-speech model supports 74 languages. (huggingface.co, elevenlabs.io, elevenlabs.io) The immediate shift is not that hosted voice platforms disappeared on April 12, 2026. It is that a free Apache-licensed model now offers multilingual cloning, local inference, and commercial use terms that put pressure on paid voice tools to compete on quality, speed, and safety rather than access alone. (huggingface.co, elevenlabs.io, github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.