inworld/tts-1.5-mini

Ultra-fast, cost-efficient text-to-speech with ~120ms latency and 15-language support

267 runs

Inworld TTS 1.5 Mini is Inworld’s ultra-fast, most cost-efficient text-to-speech model. With ~120ms median latency and support for 15 languages, it’s built for real-time conversational AI at scale.

Ranked #1 on Artificial Analysis, Inworld TTS delivers natural, expressive speech at a fraction of the cost of alternatives.

Key features

  • ~120ms median latency: Built for real-time conversation and voice agents
  • 15 languages: English, Chinese, Japanese, Korean, Russian, Italian, Spanish, Portuguese, French, German, Polish, Dutch, Hindi, Hebrew, and Arabic
  • Emotion control: Add emotion markups like [happy], [sad], [angry] to control delivery
  • Non-verbal sounds: Insert [laugh], [sigh], [cough] and other vocalizations
  • SSML pauses: Use <break time="1s" /> to insert natural pauses
  • Voice cloning: Use preset voices or bring your own cloned voice ID
  • Multiple formats: MP3, WAV, OGG Opus, and FLAC output

Preset voices

Voice Description
Ashley A warm, natural female voice
Dennis Middle-aged man with a smooth, calm and friendly voice
Alex Energetic and expressive mid-range male voice, with a mildly nasal quality
Darlene Soothing, comforting Southern female voice, ideal for bedtime stories and narrations

You can also use custom cloned voice IDs from the Inworld platform. To browse all available voices, use the List Voices API or the TTS Playground.

Audio markups

The model supports rich text markups for expressive speech:

  • Emotions: [happy], [sad], [angry], [surprised], [fearful], [disgusted]
  • Delivery styles: [laughing], [whispering]
  • Non-verbal sounds: [breathe], [clear_throat], [cough], [laugh], [sigh], [yawn]
  • Pauses: <break time="1s" />, <break time="500ms" />

Choosing between Inworld TTS models

  • TTS 1.5 Mini: Ultra-fast (~120ms), most cost-efficient — best for high-volume, latency-sensitive applications
  • TTS 1.5 Max: Best balance of quality and speed (<200ms) — best for applications where voice quality is the top priority
Model created
Model updated