Generate speech
Generate natural-sounding speech from text with these powerful models. Clone your own voice or pick from a variety of languages and speaking styles.
Our Pick: xtts-v2
For most text-to-speech needs, we recommend xtts-v2. It produces high-quality, realistic speech output and supports cloning voices from an audio sample.
A key advantage of xtts-v2 is its language support. It can generate speech in 12 languages including English, Spanish, French, German, Italian, and more. So it’s a great choice if you need multi-language capabilities.
xtts-v2 is fast and reasonably priced. Expect to pay around $0.007 for a typical paragraph of text. The main limitation is the lack of customization options. But for basic text-to-speech in various languages, it’s hard to beat.
Most Customizable: styletts2
If you want more control over the style and emotion of the generated speech, check out styletts2. It matches xtts-v2 in natural-sounding output and voice cloning. But it provides additional levers to fine-tune the result.
With styletts2, you can adjust parameters like alpha and beta to control the timbre and prosody based on the reference speech. An embedding scale setting lets you dial up or down the emotional intensity. These options give you more power to sculpt the synthesized speech to your needs.
styletts2 is a hair slower than xtts-v2 but a bit cheaper per run. The main downside is that it only supports English. But if you’re working in English and want maximum customization, it’s the way to go.
Best for Expressive Speech: Bark
Looking to generate dynamic speech with lots of variation and personality? Bark has you covered. Its specialty is expressive speech synthesis with a wide range of voices and styles.
Bark shines for creative use cases like generating realistic dialogue, characters, and even sound effects. With over 100 voices spanning different languages, genders, and tones, it offers unmatched diversity. You can also clone your own voice for even more options.
The tradeoff is that Bark is slower and pricier than xtts-v2 or styletts2. It’s also trickier to control the output for a consistent voice. But when you need the most natural and expressive speech possible, Bark is in a league of its own.
For Singing Voice Conversion: RVC
RVC is a unique offering purpose-built for “singing voice conversion”. It lets you take an existing song and modify the vocals to sound like a different singer.
While not suited for standard text-to-speech, RVC is impressive for its specialized use case. It comes with a variety of built-in voices to choose from (Squidward, Trump, Drake, etc). You can tweak settings like pitch, volume, reverb and more to dial in the effect.
RVC won’t be the right tool for everyone. But if you want to create convincing song covers or mashups, it’s a powerful option to have in your toolkit.
Featured models
Recommended models
batouresearch / spanish-f5-tts
A F5-TTS fine-tuned for Spanish
x-lance / f5-tts
F5-TTS, the new state-of-the-art in open source voice cloning
platform-kit / mars5-tts
A novel speech model for insane prosody.
chenxwh / openvoice
Updated to OpenVoice v2: Versatile Instant Voice Cloning
declare-lab / tango
Tango 2: Use text prompts to make sound effects
cjwbw / voicecraft
Zero-Shot Speech Editing and Text-to-Speech in the Wild
cjwbw / parler-tts
lightweight text-to-speech (TTS) model, trained on 10.5K hours of audio data
camenduru / metavoice
MetaVoice-1B: 1.2B parameter base model trained on 100K hours of speech
lucataco / pheme
Pheme generates a variety of conversational voices in 16 kHz for phone-call applications
zsxkib / realistic-voice-cloning
Create song covers with any RVC v2 trained AI voice from audio files.
cjwbw / seamless_communication
SeamlessM4T—Massively Multilingual & Multimodal Machine Translation
awerks / neon-tts
NeonAI Coqui AI TTS Plugin.
afiaka87 / tortoise-tts
Generate speech from text, clone voices from mp3 files. From James Betker AKA "neonbjb".