Readme
Inworld TTS 1.5 Mini is Inworld’s ultra-fast, most cost-efficient text-to-speech model. With ~120ms median latency and support for 15 languages, it’s built for real-time conversational AI at scale.
Ranked #1 on Artificial Analysis, Inworld TTS delivers natural, expressive speech at a fraction of the cost of alternatives.
Key features
- ~120ms median latency: Built for real-time conversation and voice agents
- 15 languages: English, Chinese, Japanese, Korean, Russian, Italian, Spanish, Portuguese, French, German, Polish, Dutch, Hindi, Hebrew, and Arabic
- Emotion control: Add emotion markups like
[happy],[sad],[angry]to control delivery - Non-verbal sounds: Insert
[laugh],[sigh],[cough]and other vocalizations - SSML pauses: Use
<break time="1s" />to insert natural pauses - Voice cloning: Use preset voices or bring your own cloned voice ID
- Multiple formats: MP3, WAV, OGG Opus, and FLAC output
Preset voices
| Voice | Description |
|---|---|
Ashley |
A warm, natural female voice |
Dennis |
Middle-aged man with a smooth, calm and friendly voice |
Alex |
Energetic and expressive mid-range male voice, with a mildly nasal quality |
Darlene |
Soothing, comforting Southern female voice, ideal for bedtime stories and narrations |
You can also use custom cloned voice IDs from the Inworld platform. To browse all available voices, use the List Voices API or the TTS Playground.
Audio markups
The model supports rich text markups for expressive speech:
- Emotions:
[happy],[sad],[angry],[surprised],[fearful],[disgusted] - Delivery styles:
[laughing],[whispering] - Non-verbal sounds:
[breathe],[clear_throat],[cough],[laugh],[sigh],[yawn] - Pauses:
<break time="1s" />,<break time="500ms" />
Choosing between Inworld TTS models
- TTS 1.5 Mini: Ultra-fast (~120ms), most cost-efficient — best for high-volume, latency-sensitive applications
- TTS 1.5 Max: Best balance of quality and speed (<200ms) — best for applications where voice quality is the top priority
Links
Model created
Model updated