VoiceBox: local voice cloning tool

VoiceBox is a fully open-source desktop app that clones voices from just 3‑second audio clips and runs entirely locally using qwen3-tts. It supports nuanced emotions and multiple languages, making it useful for small audio projects that need private or offline TTS without cloud dependencies. Because it runs on-device, it removes data-exfiltration concerns typical of cloud voice services. (x.com)

Most voice cloning tools work like a photocopier in someone else’s office: you upload a sample, their servers do the work, and your voice leaves your computer. VoiceBox flips that setup by packaging voice cloning into a desktop app that runs on Mac, Windows, and Linux with local inference instead of a cloud account. (github.com, voicebox.sh) Text to speech is software that turns written words into audio, and voice cloning is the extra step where the software tries to keep one specific person’s tone, accent, and rhythm. The hard part has usually been getting a usable clone from a tiny sample instead of a long, clean recording session. (github.com, qwen.ai) The model under this app is Qwen3-TTS, an open-source speech system released by the Qwen team in January 2026. Qwen says its voice-cloning variant can copy a voice from about 3 seconds of reference audio and then speak in 10 major languages with that cloned voice. (alibabacloud.com, qwen.ai, github.com) That 3-second number is what makes this notable. Older consumer tools often wanted longer samples, but Qwen3-TTS was built to work from a clip short enough to fit in a voice note, which makes quick tests and small edits much easier. (qwen.ai, github.com) Qwen3-TTS also lets users steer delivery with plain-language prompts instead of audio engineering controls. Its documentation shows instructions like speaking happily or whispering softly, which means the same cloned voice can be pushed toward different emotions without retraining a model. (github.com, qwen.ai) VoiceBox is the layer that turns that model into a desktop tool instead of a command-line project. Its GitHub page describes it as a local-first voice cloning studio with a timeline editor, multiple text-to-speech engines, post-processing effects, and a built-in application programming interface for developers who want to automate it. (github.com, github.com) The privacy angle is simple: if the model weights, the audio files, and the generation all stay on your machine, there is no routine upload step to a vendor’s server. That makes a local app attractive for podcast drafts, game prototypes, internal demos, or any project where a voice sample is sensitive enough that a cloud workflow is a nonstarter. (voicebox.sh, github.com, scriptbyai.com) There are still hardware limits, because “local” does not mean “runs well on every laptop.” VoiceBox advertises Apple Silicon support and local or remote inference options, while Qwen3-TTS ships model sizes up to 1.7 billion parameters, so speed and quality will depend heavily on the machine doing the work. (voicebox.sh, github.com) This is why the app lands in a useful middle ground. It is not just a research model on GitHub, and it is not a subscription website that meters every minute of audio; it is an open-source desktop wrapper around a newly released cloning model that gives small creators offline control over voice generation. (github.com, github.com, voicebox.sh)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.