qwen/qwen3-tts

A unified Text-to-Speech demo featuring three powerful modes: Voice, Clone and Design

3.6K runs

Qwen3-TTS

Generate natural-sounding speech in 10 languages with voice cloning and voice design

About

Qwen3-TTS is a text-to-speech model from the Qwen team at Alibaba Cloud. This model turns text into natural-sounding speech across 10 languages, including Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.

The model can clone a voice from just 3 seconds of audio, create entirely new voices from text descriptions, and handle multiple Chinese dialects. It’s trained on over 5 million hours of speech data and uses a dual-track architecture that keeps latency low—just 97 milliseconds for the first audio packet.

How it works

Qwen3-TTS uses a language model architecture paired with a custom speech tokenizer called Qwen3-TTS-Tokenizer-12Hz. This tokenizer compresses speech into efficient codes while preserving the details that make voices sound natural—like tone, emotion, and speaking style.

The model comes in two sizes: a 1.7 billion parameter version for the best quality, and a 600 million parameter version that’s faster while still producing good results.

What you can do with it

Clone voices: Give the model 3 seconds of audio and it can reproduce that voice speaking any text you provide. The cloning works across languages too—clone a voice in English and use it to speak Chinese.

Design new voices: Describe the voice you want in plain language and the model creates it. You might ask for “a warm storyteller voice with gentle pacing” or “a deep male voice with a British accent.” The model interprets your description and generates matching speech.

Control speech style: Use natural language instructions to adjust how the speech sounds. You can control emotion, speaking speed, and tone. The model adapts its output based on the meaning of your text, placing pauses naturally and emphasizing the right words.

Handle dialects: Beyond standard languages, the model supports Chinese dialects including Sichuan, Beijing, and others, capturing regional speech patterns accurately.

Use cases

This model works well for:

  • Creating audiobooks with consistent narrator voices
  • Building conversational AI that needs low-latency responses
  • Localizing content across multiple languages
  • Making accessibility tools with familiar voice patterns
  • Prototyping voice interfaces quickly

Technical details

The model achieves better performance than comparable systems on multilingual benchmarks. In word error rate tests across 10 languages, it outperforms models like GPT-4o Audio Preview and ElevenLabs.

The dual-track streaming architecture lets the model start outputting audio almost immediately while generating the rest. This makes it useful for real-time applications like live translation or interactive voice responses.

The model handles noisy input text well and maintains stability even when generating long-form speech—it can produce over 10 minutes of continuous audio without degrading quality or introducing artifacts.

Limitations

The 1.7 billion parameter model handles background noise better than the 600 million parameter version. For best results with voice cloning, use clean audio recordings without background sounds.

The model works best when you provide clear, well-formed text. While it’s robust to some input errors, giving it clean text produces better results.

Try it yourself

You can try this model on the Replicate Playground at replicate.com/playground

Model created
Model updated