Alibaba open-sources Qwen3-TTS with voice cloning from three seconds of audio

TL;DR — Alibaba’s Qwen team open-sourced Qwen3-TTS on January 22, 2026. The model family supports voice cloning from three seconds of reference audio, 10 languages, and streaming synthesis at 97 milliseconds latency. It ships under the Apache 2.0 license.

What Qwen3-TTS delivers

Qwen3-TTS comes in two sizes: a 1.7 billion parameter model (Qwen3-TTS-12Hz-1.7B) and a 600 million parameter variant (Qwen3-TTS-12Hz-0.6B). Both use a discrete multi-codebook language model architecture, which differs from the more common LM+DiT pipeline used by competitors.

The headline features are zero-shot voice cloning from three seconds of audio, natural language voice design (describing the desired voice in plain text), and controllable emotion and prosody. Supported languages include Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.

Why it matters

At 97ms time-to-first-byte, Qwen3-TTS competes on latency with commercial offerings from Cartesia (Sonic) and Deepgram (Aura). The Apache 2.0 license removes the licensing friction that limits adoption of other open-source TTS models.

For developers building voice agents or multilingual audio products, Qwen3-TTS offers a self-hosted alternative that avoids per-character API costs entirely. The model is available on Hugging Face and integrates with vLLM for production serving.

Source: Qwen AI Blog, January 22, 2026.