Readme

Chatterbox Multilingual TTS 🌍🎙️

Overview 🗣️

Chatterbox Multilingual TTS is a text-to-speech model that generates natural, expressive speech across 23 languages. This tool is built upon the amazing work of Resemble AI and their groundbreaking multilingual TTS research. We’ve wrapped their Chatterbox Multilingual to work on Replicate! This allows you to clone voices and generate speech in any of 23 supported languages with cross-language voice transfer capabilities.

Support Resemble AI and learn more about their work: - Resemble AI Website - Hugging Face Demo - Research Paper

Supported Languages 🌐

The model supports 23 languages with native voice cloning and cross-language transfer:

Arabic (ar) • Chinese (zh) • Danish (da) • Dutch (nl) • English (en) • Finnish (fi) • French (fr) • German (de) • Greek (el) • Hebrew (he) • Hindi (hi) • Italian (it) • Japanese (ja) • Korean (ko) • Malay (ms) • Norwegian (no) • Polish (pl) • Portuguese (pt) • Russian (ru) • Spanish (es) • Swahili (sw) • Swedish (sv) • Turkish (tr)

Cross-Language Voice Transfer ✨

One of the most powerful features is cross-language voice transfer - clone a voice in one language and have it speak fluently in any of the other 23 supported languages while preserving the speaker’s unique characteristics.

For example: - Clone an English speaker’s voice → Generate French speech in their voice - Use a Spanish reference → Create natural Japanese speech - Take a Hindi speaker → Produce German audio with their vocal characteristics

Getting Started 🚀

Basic Text-to-Speech

Simply provide text and select a language - the model will use a high-quality default voice:

text_to_synthesize: "Bonjour, comment allez-vous aujourd'hui?"
language_id: "fr"

Voice Cloning with Reference Audio

Upload a short audio clip (even 3-10 seconds works!) to clone that speaker’s voice:

text_to_synthesize: "Hello, this is a test of voice cloning technology."
language_id: "en"  
reference_audio: [upload your audio file]

Cross-Language Voice Cloning

Clone a voice from one language and use it in another:

text_to_synthesize: "Guten Tag, wie geht es Ihnen?"
language_id: "de"
reference_audio: [English speaker audio file]
cfg_weight: 0.0  # Set to 0 for best cross-language transfer

Parameter Controls 🎛️

exaggeration (0.25-2.0): Controls speech expressiveness. Neutral = 0.5, higher values add more emotion and emphasis. Extreme values can be unstable.

temperature (0.05-5.0): Controls randomness in generation. Lower = more consistent, higher = more varied pronunciation and prosody.

cfg_weight (0.2-1.0): CFG/Pace weight controlling generation guidance. Use 0.5 for normal speech, 0.0 for cross-language voice transfer to reduce accent bleed.

seed (integer): Random seed for reproducible results. Set to 0 for random generation.

Pro Tips for Best Results 💡

For Cross-Language Voice Transfer: - Set cfg_weight to 0.0 to minimize accent from the reference language - Use clear, high-quality reference audio (3-10 seconds is sufficient) - The reference audio language doesn’t need to match your target language

For Expressive Speech: - Try lower cfg values (~0.3) with higher exaggeration (0.7+) for dramatic speech - Higher exaggeration speeds up speech - reduce cfg to compensate with slower pacing

General Quality Tips: - Use reference audio with minimal background noise - Shorter reference clips often work better than longer ones - Default settings (exaggeration=0.5, cfg=0.5) work well for most use cases

Audio Quality Examples 🎵

The model produces broadcast-quality audio suitable for: - Multilingual Content: Podcasts, videos, audiobooks in multiple languages - Voice Dubbing: Convert content between languages with consistent speakers - Character Voices: Games and animations with multilingual character consistency - Accessibility: Text-to-speech for multilingual applications - E-learning: Educational content with native-sounding pronunciation

Technical Details 🔬

Architecture: Based on advanced transformer architecture optimized for multilingual synthesis
Training Data: Trained on high-quality multilingual speech data
Sample Rate: 24kHz output for professional audio quality
Latency: Fast inference suitable for real-time applications
Watermarking: Built-in neural watermarking for responsible AI usage

Limitations to Consider ⚠️

Very short reference audio (under 2 seconds) may not capture full voice characteristics
Cross-language transfer works best with cfg_weight set to 0.0
Some language combinations may have slight accent bleed without proper cfg tuning
Extremely high exaggeration values (>1.5) can introduce artifacts
Processing time scales with text length (max 300 characters per request)

Terms of Use 📚

The use of this voice synthesis technology for the following purposes is prohibited:

Creating audio content to deceive, defraud, or mislead others about the identity of the speaker.
Generating speech that violates someone’s rights without their explicit consent.
Creating content for harassment, threats, or intimidation.
Producing audio that spreads misinformation or fake news.
Impersonating public figures, officials, or other individuals for malicious purposes.
Any use that violates applicable laws or regulations regarding synthetic media and deepfakes.

Disclaimer ‼️

I am not liable for any direct, indirect, consequential, incidental, or special damages arising out of or in any way connected with the use/misuse or inability to use this software. Users are responsible for ensuring their use complies with all applicable laws and respects others’ rights.

Model created 5 months, 3 weeks ago

Run time and cost