Official

resemble-ai / chatterbox

Generate expressive, natural speech. Features unique emotion control, instant voice cloning from short audio, and built-in watermarking.

  • Public
  • 15.4K runs
  • Weights
  • License
Iterate in playground

Input

*string
Shift + Return to add a new line

Text to synthesize

file

Path to the reference audio file (Optional)

number
(minimum: 0.25, maximum: 2)

Exaggeration (Neutral = 0.5, extreme values can be unstable)

Default: 0.5

number
(minimum: 0.2, maximum: 1)

CFG/Pace weight

Default: 0.5

number
(minimum: 0.05, maximum: 5)

Temperature

Default: 0.8

integer

Seed (0 for random)

Default: 0

Output

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
Generated in
Input tokens
160
Output tokens
1
Tokens per second
0.08 tokens / second
Time to first token

Pricing

Model pricing for resemble-ai/chatterbox. Looking for volume pricing? Get in touch.

$0.025
per thousand input characters

or 40,000 characters for $1

Official models are always on, maintained, and have predictable pricing. Learn more.

Check out our docs for more information about how pricing works on Replicate.

Readme

cb-big2

Chatterbox TTS

We’re excited to introduce Chatterbox, Resemble AI’s first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.

Whether you’re working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It’s also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out. Try it now on our Hugging Face Gradio app.

If you like the model but need to scale or tune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200ms—ideal for production use in agents, applications, or interactive media.

Key Details

  • SoTA zeroshot TTS
  • 0.5B Llama backbone
  • Unique exaggeration/intensity control
  • Ultra-stable with alignment-informed inference
  • Trained on 0.5M hours of cleaned data
  • Watermarked outputs
  • Easy voice conversion script
  • Outperforms ElevenLabs

Tips

  • General Use (TTS and Voice Agents):
  • The default settings (exaggeration=0.5, cfg=0.5) work well for most prompts.
  • If the reference speaker has a fast speaking style, lowering cfg to around 0.3 can improve pacing.

  • Expressive or Dramatic Speech:

  • Try lower cfg values (e.g. ~0.3) and increase exaggeration to around 0.7 or higher.
  • Higher exaggeration tends to speed up speech; reducing cfg helps compensate with slower, more deliberate pacing.

Acknowledgements

Built-in PerTh Watermarking for Responsible AI

Every audio file generated by Chatterbox includes Resemble AI’s Perth (Perceptual Threshold) Watermarker - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.

Disclaimer

Don’t use this model to do bad things. Prompts are sourced from freely available data on the internet.