google/gemini-3.1-flash-tts

Google's fast, expressive text-to-speech model with 30 voices and 70+ language support

139 runs

Readme

Gemini 3.1 Flash TTS

Google’s Gemini 3.1 Flash TTS converts text into natural-sounding speech with fine-grained control over delivery. You describe how the text should be spoken using a style prompt, and inline tags let you adjust tone, pace, and emotion mid-sentence. It supports 30 voices and 70+ languages.

The model is optimized for low latency, making it a good fit for real-time applications, audiobooks, voiceovers, podcasts, and accessibility.

Inputs

Input Description Default
text The text to convert to speech. Supports inline [tags]. Max 4,000 bytes. (required)
voice Which voice to use. See the full list below. Kore
prompt Style instructions — tone, pace, accent, emotion, character context. Max 4,000 bytes. Say the following.
language_code BCP-47 language code for the output. en-US

The combined size of text + prompt can’t exceed 8,000 bytes. Output audio is capped at roughly 655 seconds.

Style prompting

The prompt input sets the overall context and delivery direction. The more specific you are, the better the results. A good prompt can include:

  • Who is speaking — A radio DJ, a documentary narrator, a children’s book reader. Giving the speaker a name and identity helps ground the performance.
  • The scene — Where are they? What’s happening? Environmental context subtly shapes delivery.
  • Director’s notes — Style, accent, and pacing instructions. “Infectious enthusiasm, the listener should feel part of a community event” works better than “energetic.”

Example

prompt:

AUDIO PROFILE: Jaz R., "The Morning Hype"

THE SCENE: A glass-walled studio overlooking the London skyline.
The red ON AIR tally light is blazing. Jaz is bouncing on the balls
of their heels to the rhythm of a thumping backing track.

DIRECTOR'S NOTES:
- Style: You must hear the grin in the audio. Bright, sunny, inviting.
- Accent: Jaz is from Brixton, London.
- Pace: Energetic bouncing cadence, high-speed delivery, no dead air.

text:

[excitedly] Yes, massive vibes in the studio! You are locked in and
it is absolutely popping off in London right now. [shouting] Turn
this up! We've got the project roadmap landing in three, two... let's go!

Inline tags

Tags are bracketed modifiers placed directly in your text. They give you targeted control over specific parts of the speech.

[laughing] That's hilarious! [whispering] But don't tell anyone.

Non-speech sounds

These insert an audible vocalization without speaking the tag itself.

Tag What it does
[sigh] Inserts a sigh
[laughing] Inserts a laugh
[uhm] Inserts a hesitation sound

Style modifiers

These change the delivery of the text that follows.

Tag What it does
[whispering] Quiet, whispered delivery
[shouting] Loud, projected delivery
[sarcasm] Sarcastic tone
[robotic] Robotic-sounding delivery
[extremely fast] Speeds up the speech

Pauses

These insert silence for rhythm and pacing control.

Tag Duration
[short pause] ~250ms (like a comma)
[medium pause] ~500ms (like a sentence break)
[long pause] ~1000ms+ (dramatic effect)

You can also use any descriptive tag — [excitedly], [bored], [reluctantly], [singing], [asmr] — and the model will interpret it. Test new tags before relying on them in production, since some may be spoken aloud rather than used as modifiers.

Voices

30 prebuilt voices are available, each with a distinct character:

Voice Gender Character
Zephyr Female Bright
Puck Male Upbeat
Charon Male Informative
Kore Female Firm
Fenrir Male Excitable
Leda Female Youthful
Orus Male Firm
Aoede Female Breezy
Callirrhoe Female Easy-going
Autonoe Female Bright
Enceladus Male Breathy
Iapetus Male Clear
Umbriel Male Easy-going
Algenib Male Gravelly
Despina Female Smooth
Erinome Female Clear
Laomedeia Female Upbeat
Achernar Female Soft
Algieba Male Smooth
Schedar Male Even
Gacrux Female Mature
Pulcherrima Female Forward
Achird Male Friendly
Zubenelgenubi Male Casual
Vindemiatrix Female Gentle
Sadachbia Male Lively
Sadaltager Male Knowledgeable
Sulafat Female Warm
Alnilam Male Firm
Rasalgethi Male Informative

Languages

The model auto-detects input language, but you can set language_code explicitly for best results. Supported languages include English, French, German, Spanish, Portuguese, Italian, Dutch, Japanese, Korean, Hindi, Arabic, Russian, Polish, Romanian, Turkish, Thai, Vietnamese, Indonesian, and 50+ more in preview.

See Google’s full language list for BCP-47 codes.

Tips

  • Align everything. The style prompt, the text content, and any tags should all point in the same direction. A scared-sounding prompt works best with text that actually sounds alarming.
  • Don’t overspecify. The model fills in gaps naturally. Leaving some room often produces more natural results than controlling every detail.
  • Use rich text. Emotionally evocative text gives the model more to work with. “I just heard a window break” produces a more genuinely scared result than “Something happened.”
  • Tags for precision, prompts for tone. Use tags when you need a specific moment (a laugh, a pause, a whisper). Use the prompt to set the overall feel.
  • Punctuation matters. Commas, periods, and semicolons create natural pauses. Use them to help the model breathe.
Model created
Model updated