inworld/realtime-tts-2

Most expressive text-to-speech model from Inworld, with natural-language steering, real-time latency, and multilingual support across 100+ languages.

72 runs

Inworld Realtime TTS 2.0 (inworld-tts-2) is Inworld’s most powerful and expressive text-to-speech model. It builds on Realtime TTS 1.5 with rich, expressive speech and real-time latency, plus natural-language steering and stronger multilingual capabilities.

What’s new in 2.0

  • Natural-language steering — Direct any voice with bracketed natural-language cues like [say excitedly], [whisper in a hushed style], or [say sadly with deliberate pauses in a low voice]. No preset list — just write your directions in natural language.
  • Stronger multilingual capabilities — More natural, native speech across 15 production languages, plus experimental support for 90+ additional languages.
  • Improved alphanumerics — Better support for phone numbers, SKUs, and codes.
  • Quality — Builds on the #1-ranked model on Artificial Analysis with improved word error rate, sharper alignment, and more natural speech.

Steering with natural language

Wrap your direction in square brackets and place it before the text it applies to. Think of it as giving direction to a voice actor.

Free-form direction

[speak as if barely holding back rage — forcing every word through gritted teeth] I have told you. Repeatedly. And you STILL didn't listen.

[overwhelmed with excitement and barely able to contain yourself] We just hit a million users. I still can't believe it!

[slow and hushed with every word weighted by grief] I got the call this morning. He's gone.

Single-property direction

Quality Examples
Articulation [say with force], [articulate clearly], [say with deliberate pauses]
Intonation [say with a falling pitch], [say with a rising pitch]
Volume [very loud], [very quiet]
Pitch [say in a low tone], [say in a high pitch]
Range [say playfully], [say with no pitch variation]
Speed [very fast], [very slow]
Vocal style [whisper in a hushed style], [sing joyfully], [give a nasal quality]

The more you describe how you want the voice to perform, the better the output. A bare tag like [sad] gives the model one dimension to work with. A fuller instruction like [say sadly with deliberate pauses in a low voice and hushed style] combines mood, rhythm, pitch, and mode — producing a more nuanced and convincing performance.

Inline non-verbals

Insert organic, human sounds at any point in the text:

[laugh] [breathe] [clear throat] [sigh] [cough] [yawn]

I told him what happened, and he just [laugh] couldn't believe it!

Emphasis

Capitalize letters within your input text to draw attention to specific words or syllables.

I told you NOT to open that door.

Are you seriously asking if I want pizza? AbsoLUTEly I do.

Languages

Production languages (best quality): English, Chinese, Japanese, Korean, Russian, Italian, Spanish, Portuguese, French, German, Polish, Dutch, Hindi, Hebrew, Arabic.

Experimental support for 90+ additional languages including Vietnamese, Thai, Turkish, Indonesian, Swedish, and many more. See the language support docs for the full list.

Set language to auto (default) to let the model detect the language, or pick one of the production codes for best results.

Preset voices

Voice Description
Ashley A warm, natural female voice
Dennis Middle-aged man with a smooth, calm and friendly voice
Alex Energetic and expressive mid-range male voice, with a mildly nasal quality
Darlene Soothing, comforting Southern female voice, ideal for bedtime stories and narrations

You can also use custom cloned voice IDs from the Inworld platform. To browse all voices, use the List Voices API or the TTS Playground.

Best practices

  • Match the instruction to the text. Mismatches like [say sadly] on celebratory text degrade output quality.
  • Avoid conflicting instructions. Combining opposing directions in the same tag (e.g. [whisper in a hushed style] and [very loud]) produces unpredictable results.
  • Place steering tags before the text they apply to. Non-verbal tags like [laugh] are the exception — they can be inserted inline where the sound should occur.

Choosing between Inworld TTS models

  • Realtime TTS 2.0 (inworld/realtime-tts-2) — Most expressive, with natural-language steering and stronger multilingual support. Best when you need fine-grained control over delivery.
  • Realtime TTS 1.5 Max (inworld/realtime-tts-1.5-max) — #1-ranked on Artificial Analysis with sub-200ms latency. Best for raw 1.5-era quality.
  • Realtime TTS 1.5 Mini (inworld/realtime-tts-1.5-mini) — Ultra-fast (~120ms), most cost-efficient. Best for high-volume, latency-sensitive applications.
Model created
Model updated