Generate speech
Generate natural-sounding speech from text with these powerful models. Clone your own voice or pick from a variety of languages and speaking styles.
Our Pick: xtts-v2
For most text-to-speech needs, we recommend xtts-v2. It produces high-quality, realistic speech output and supports cloning voices from an audio sample.
A key advantage of xtts-v2 is its language support. It can generate speech in 12 languages including English, Spanish, French, German, Italian, and more. So it’s a great choice if you need multi-language capabilities.
xtts-v2 is fast and reasonably priced. Expect to pay around $0.007 for a typical paragraph of text. The main limitation is the lack of customization options. But for basic text-to-speech in various languages, it’s hard to beat.
Most Customizable: styletts2
If you want more control over the style and emotion of the generated speech, check out styletts2. It matches xtts-v2 in natural-sounding output and voice cloning. But it provides additional levers to fine-tune the result.
With styletts2, you can adjust parameters like alpha and beta to control the timbre and prosody based on the reference speech. An embedding scale setting lets you dial up or down the emotional intensity. These options give you more power to sculpt the synthesized speech to your needs.
styletts2 is a hair slower than xtts-v2 but a bit cheaper per run. The main downside is that it only supports English. But if you’re working in English and want maximum customization, it’s the way to go.
Best for Expressive Speech: Bark
Looking to generate dynamic speech with lots of variation and personality? Bark has you covered. Its specialty is expressive speech synthesis with a wide range of voices and styles.
Bark shines for creative use cases like generating realistic dialogue, characters, and even sound effects. With over 100 voices spanning different languages, genders, and tones, it offers unmatched diversity. You can also clone your own voice for even more options.
The tradeoff is that Bark is slower and pricier than xtts-v2 or styletts2. It’s also trickier to control the output for a consistent voice. But when you need the most natural and expressive speech possible, Bark is in a league of its own.
For Singing Voice Conversion: RVC
RVC is a unique offering purpose-built for “singing voice conversion”. It lets you take an existing song and modify the vocals to sound like a different singer.
While not suited for standard text-to-speech, RVC is impressive for its specialized use case. It comes with a variety of built-in voices to choose from (Squidward, Trump, Drake, etc). You can tweak settings like pitch, volume, reverb and more to dial in the effect.
RVC won’t be the right tool for everyone. But if you want to create convincing song covers or mashups, it’s a powerful option to have in your toolkit.
Recommended models
lucataco / xtts-v2
Coqui XTTS-v2: Multilingual Text To Speech Voice Cloning
zsxkib / realistic-voice-cloning
Create song covers with any RVC v2 trained AI voice from audio files.
suno-ai / bark
🔊 Text-Prompted Generative Audio Model
afiaka87 / tortoise-tts
Generate speech from text, clone voices from mp3 files. From James Betker AKA "neonbjb".
adirik / styletts2
Generates speech from text
awerks / neon-tts
NeonAI Coqui AI TTS Plugin.
cjwbw / seamless_communication
SeamlessM4T—Massively Multilingual & Multimodal Machine Translation
declare-lab / tango
Tango 2: Use text prompts to make sound effects
camenduru / metavoice
MetaVoice-1B: 1.2B parameter base model trained on 100K hours of speech