Generate natural-sounding speech from text. Clone voices, control emotions, and produce audio in dozens of languages.
Inworld Realtime TTS 2.0 is the most expressive TTS model on Replicate. Direct any voice with bracketed natural-language cues like [say excitedly], [whisper in a hushed style], or [speak as if barely holding back rage] — no preset list, just write your directions like you're directing a voice actor. Real-time latency, 15 production languages plus experimental support for 90+ more, and inline non-verbals like [laugh], [sigh], and [breathe].
MiniMax Speech 2.8 HD ranks #1 on TTS benchmarks, outperforming both OpenAI and ElevenLabs in blind evaluations. Studio-grade voice synthesis with 17+ preset voices, emotion control, voice cloning from just 5 seconds of audio, and support for 32+ languages. The best choice for voiceovers, audiobooks, and polished content.
ElevenLabs v3 supports audio tags like [excited], [whispers], and [sighs] for fine-grained delivery control. Supports 70+ languages and 26 voices. Great for film, audiobooks, and creative media when you want a curated set of expressive tags rather than free-form direction.
Gemini 3.1 Flash TTS from Google gives you fine-grained control over delivery through inline tags and style prompting. Set a scene, define a character, and direct the performance — "you must hear the grin in the audio." 30 voices, 70+ languages, and natural-sounding output with rich expressiveness.
MiniMax Speech 2.8 Turbo is optimized for low-latency applications like voice agents, chatbots, and interactive experiences. Supports 40+ languages with the same voice cloning and emotion control as the HD version.
Inworld Realtime TTS 1.5 Mini achieves ~120ms latency — the fastest in this collection. Supports 15 languages with emotion markups and SSML break tags. Inworld Realtime TTS 1.5 Max trades a bit of speed for higher quality at <200ms latency.
Chatterbox from Resemble AI excels at voice cloning with emotional control — generate distinct character voices from just a few seconds of reference audio. Great for games, animations, and storytelling.
ElevenLabs v2 Multilingual generates speech in 29 languages while maintaining consistent voice quality across all of them. Good for localization workflows where the same voice needs to work in multiple languages.
Tortoise TTS is an open-source option that produces high-quality speech. Slower than the commercial models but fully self-hostable.
Featured models

Most expressive text-to-speech model from Inworld, with natural-language steering, real-time latency, and multilingual support across 100+ languages.
Updated 1 week ago
1.1K runs

Minimax Speech 2.8 HD focuses on high-fidelity audio generation with features like studio-grade quality, flexible emotion control, multilingual support, and voice cloning capabilities
Updated 3 weeks, 1 day ago
87.6K runs

Minimax Speech 2.8 Turbo: Turn text into natural, expressive speech with voice cloning, emotion control, and support for 40+ languages
Updated 3 weeks, 1 day ago
151.5K runs

Google's fast, expressive text-to-speech model with 30 voices and 70+ language support
Updated 3 weeks, 4 days ago
53.3K runs

Highest-quality realtime text-to-speech with <200ms latency, emotion control, and 15-language support
Updated 3 weeks, 6 days ago
101.4K runs

Ultra-fast, cost-efficient realtime text-to-speech with ~120ms latency and 15-language support
Updated 3 weeks, 6 days ago
36.5K runs

The fastest open source TTS model without sacrificing quality.
Updated 4 months, 4 weeks ago
339.6K runs

The most expressive Text to Speech model
Updated 6 months, 2 weeks ago
40.3K runs

Generate expressive, natural speech. Features unique emotion control, instant voice cloning from short audio, and built-in watermarking.
Updated 10 months, 3 weeks ago
289.4K runs
Recommended Models
For maximum expressiveness and control, try inworld/realtime-tts-2 — you can direct the voice with natural-language cues like [say excitedly] or [whisper in a hushed style]. For polished, studio-grade audio, minimax/speech-2.8-hd ranks #1 on benchmarks and supports 32+ languages with voice cloning and emotion control.
inworld/realtime-tts-2 supports free-form natural-language steering — you write directions like you're directing a voice actor. For example: [overwhelmed with excitement and barely able to contain yourself] We just hit a million users. elevenlabs/v3 takes a different approach with curated audio tags like [excited], [whispers], and [sighs].
inworld/realtime-tts-1.5-mini achieves ~120ms latency — the fastest in this collection. inworld/realtime-tts-2 and minimax/speech-2.8-turbo are also designed for low-latency real-time use. Great for chatbots, voice agents, and interactive apps.
minimax/speech-2.8-hd and minimax/speech-2.8-turbo both support voice cloning from just 5 seconds of reference audio. resemble-ai/chatterbox is another option with emotional control, especially good for character voices in games and animation. The Inworld models also support custom cloned voice IDs created on the Inworld platform.
elevenlabs/v3 supports 70+ languages. inworld/realtime-tts-2 supports 15 production languages plus experimental support for 90+ more. minimax/speech-2.8-turbo and minimax/speech-2.8-hd support 40+ and 32+ languages respectively. elevenlabs/v2-multilingual supports 29 languages with consistent voice quality across all of them.
Yes — most modern TTS models support emotion control, but they take different approaches. inworld/realtime-tts-2 lets you write free-form natural-language directions like [say sadly with deliberate pauses in a low voice]. elevenlabs/v3 uses curated audio tags. MiniMax models support presets like happy, sad, angry, fearful, and calm. The Inworld 1.5 models support emotion markups like [happy], [sad], plus non-verbal sounds like [laugh] and [sigh].
afiaka87/tortoise-tts is open-source and produces high-quality speech. It's slower than commercial models but can be self-hosted on your own hardware.
Most models support commercial use. Some may include audio watermarking — check each model's license page for specifics, especially regarding voice cloning and redistribution.
Recommended Models

A unified Text-to-Speech demo featuring three powerful modes: Voice, Clone and Design
Updated 4 days, 18 hours ago
450.2K runs

Text-to-Audio (T2A) that offers voice synthesis, emotional expression, and multilingual capabilities. Optimized for high-fidelity applications like voiceovers and audiobooks.
Updated 3 weeks, 1 day ago
2.3M runs

MiniMax Speech 2.6 HD delivers studio-quality multilingual text-to-audio on Replicate with nuanced prosody, subtitle export, and premium voices
Updated 3 weeks, 1 day ago
181.7K runs

Text-to-Audio (T2A) that offers voice synthesis, emotional expression, and multilingual capabilities. Designed for real-time applications with low latency
Updated 3 weeks, 1 day ago
12.3M runs

Low‑latency MiniMax Speech 2.6 Turbo brings multilingual, emotional text-to-speech to Replicate with 300+ voices and real-time friendly pricing
Updated 3 weeks, 1 day ago
870.3K runs

Clone voices to use with Minimax's speech-02-hd and speech-02-turbo
Updated 6 months ago
64.3K runs

High quality, low latency text to speech in 32 languages
Updated 6 months, 2 weeks ago
30.4K runs

Generate multilingual text-to-speech audio in over 30 languages
Updated 6 months, 2 weeks ago
11.1K runs

ElevenLabs's fastest speech synthesis model
Updated 6 months, 2 weeks ago
28.6K runs

Generate expressive, natural speech in 23 languages. Features instant voice cloning from short audio, emotion control, and seamless cross-language voice transfer.
Updated 8 months, 1 week ago
72.7K runs

zsxkib/diaDia 1.6B by Nari Labs, Generates realistic dialogue audio from text, including non-verbal cues and voice cloning
Updated 9 months, 4 weeks ago
14.6K runs

Generate expressive, natural speech with Resemble AI's Chatterbox.
Updated 10 months, 3 weeks ago
19K runs

lucataco/csm-1bCSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs
Updated 1 year, 1 month ago
1.2K runs

lucataco/orpheus-3b-0.1-ftOrpheus 3B - high quality, emotive Text to Speech
Updated 1 year, 1 month ago
35.4K runs

cjwbw/voicecraftZero-Shot Speech Editing and Text-to-Speech in the Wild
Updated 1 year, 1 month ago
10.9K runs

jaaari/kokoro-82mKokoro v1.0 - text-to-speech (82M params, based on StyleTTS2)
Updated 1 year, 3 months ago
91.2M runs

A F5-TTS fine-tuned for Spanish
Updated 1 year, 6 months ago
1.7K runs

F5-TTS, the new state-of-the-art in open source voice cloning
Updated 1 year, 6 months ago
44.2K runs

platform-kit/mars5-ttsA novel speech model for insane prosody.
Updated 1 year, 10 months ago
547 runs

chenxwh/openvoiceUpdated to OpenVoice v2: Versatile Instant Voice Cloning
Updated 1 year, 11 months ago
86.3K runs

cjwbw/parler-ttslightweight text-to-speech (TTS) model, trained on 10.5K hours of audio data
Updated 2 years ago
2.8K runs

adirik/styletts2Generates speech from text
Updated 2 years, 3 months ago
132.5K runs

lucataco/phemePheme generates a variety of conversational voices in 16 kHz for phone-call applications
Updated 2 years, 4 months ago
583 runs

lucataco/xtts-v2Coqui XTTS-v2: Multilingual Text To Speech Voice Cloning
Updated 2 years, 5 months ago
6.4M runs

zsxkib/realistic-voice-cloningCreate song covers with any RVC v2 trained AI voice from audio files.
Updated 2 years, 5 months ago
1.8M runs

cjwbw/seamless_communicationSeamlessM4T—Massively Multilingual & Multimodal Machine Translation
Updated 2 years, 7 months ago
110K runs

awerks/neon-ttsNeonAI Coqui AI TTS Plugin.
Updated 2 years, 9 months ago
204.8K runs

suno-ai/bark🔊 Text-Prompted Generative Audio Model
Updated 3 years ago
307.7K runs

afiaka87/tortoise-ttsGenerate speech from text, clone voices from mp3 files. From James Betker AKA "neonbjb".
Updated 3 years, 9 months ago
173.6K runs