Collections

Generate speech

Generate natural-sounding speech from text with these powerful models. Clone your own voice or pick from a variety of languages and speaking styles.

Our Pick: xtts-v2

For most text-to-speech needs, we recommend xtts-v2. It produces high-quality, realistic speech output and supports cloning voices from an audio sample.

A key advantage of xtts-v2 is its language support. It can generate speech in 12 languages including English, Spanish, French, German, Italian, and more. So it’s a great choice if you need multi-language capabilities.

xtts-v2 is fast and reasonably priced. Expect to pay around $0.007 for a typical paragraph of text. The main limitation is the lack of customization options. But for basic text-to-speech in various languages, it’s hard to beat.

Most Customizable: styletts2

If you want more control over the style and emotion of the generated speech, check out styletts2. It matches xtts-v2 in natural-sounding output and voice cloning. But it provides additional levers to fine-tune the result.

With styletts2, you can adjust parameters like alpha and beta to control the timbre and prosody based on the reference speech. An embedding scale setting lets you dial up or down the emotional intensity. These options give you more power to sculpt the synthesized speech to your needs.

styletts2 is a hair slower than xtts-v2 but a bit cheaper per run. The main downside is that it only supports English. But if you’re working in English and want maximum customization, it’s the way to go.

Best for Expressive Speech: Bark

Looking to generate dynamic speech with lots of variation and personality? Bark has you covered. Its specialty is expressive speech synthesis with a wide range of voices and styles.

Bark shines for creative use cases like generating realistic dialogue, characters, and even sound effects. With over 100 voices spanning different languages, genders, and tones, it offers unmatched diversity. You can also clone your own voice for even more options.

The tradeoff is that Bark is slower and pricier than xtts-v2 or styletts2. It’s also trickier to control the output for a consistent voice. But when you need the most natural and expressive speech possible, Bark is in a league of its own.

For Singing Voice Conversion: RVC

RVC is a unique offering purpose-built for “singing voice conversion”. It lets you take an existing song and modify the vocals to sound like a different singer.

While not suited for standard text-to-speech, RVC is impressive for its specialized use case. It comes with a variety of built-in voices to choose from (Squidward, Trump, Drake, etc). You can tweak settings like pitch, volume, reverb and more to dial in the effect.

RVC won’t be the right tool for everyone. But if you want to create convincing song covers or mashups, it’s a powerful option to have in your toolkit.

Recommended models

minimax / speech-02-turbo

Text-to-Audio (T2A) that offers voice synthesis, emotional expression, and multilingual capabilities. Designed for real-time applications with low latency

3.8K runs

minimax / voice-cloning

Clone voices to use with Minimax's speech-02-hd and speech-02-turbo

651 runs

cjwbw / voicecraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild

10.2K runs

fermatresearch / spanish-f5-tts

A F5-TTS fine-tuned for Spanish

500 runs

x-lance / f5-tts

F5-TTS, the new state-of-the-art in open source voice cloning

23.5K runs

platform-kit / mars5-tts

A novel speech model for insane prosody.

469 runs

chenxwh / openvoice

Updated to OpenVoice v2: Versatile Instant Voice Cloning

57.7K runs

declare-lab / tango

Tango 2: Use text prompts to make sound effects

28.4K runs

cjwbw / parler-tts

lightweight text-to-speech (TTS) model, trained on 10.5K hours of audio data

2.5K runs

camenduru / metavoice

MetaVoice-1B: 1.2B parameter base model trained on 100K hours of speech

12.1K runs

adirik / styletts2

Generates speech from text

130.9K runs

lucataco / pheme

Pheme generates a variety of conversational voices in 16 kHz for phone-call applications

519 runs

zsxkib / realistic-voice-cloning

Create song covers with any RVC v2 trained AI voice from audio files.

706.2K runs

cjwbw / seamless_​communication

SeamlessM4T—Massively Multilingual & Multimodal Machine Translation

82.6K runs

awerks / neon-tts

NeonAI Coqui AI TTS Plugin.

120.9K runs

suno-ai / bark

🔊 Text-Prompted Generative Audio Model

298.6K runs

afiaka87 / tortoise-tts

Generate speech from text, clone voices from mp3 files. From James Betker AKA "neonbjb".

170.1K runs