Readme

Gemini 3.1 Flash TTS

Google’s Gemini 3.1 Flash TTS converts text into natural-sounding speech with fine-grained control over delivery. You describe how the text should be spoken using a style prompt, and inline tags let you adjust tone, pace, and emotion mid-sentence. It supports 30 voices and 70+ languages.

The model is optimized for low latency, making it a good fit for real-time applications, audiobooks, voiceovers, podcasts, and accessibility.

Inputs

Input	Description	Default
`text`	The text to convert to speech. Supports inline `[tags]`. Max 4,000 bytes.	(required)
`voice`	Which voice to use. See the full list below.	`Kore`
`prompt`	Style instructions — tone, pace, accent, emotion, character context. Max 4,000 bytes.	`Say the following.`
`language_code`	BCP-47 language code for the output.	`en-US`

The combined size of text + prompt can’t exceed 8,000 bytes. Output audio is capped at roughly 655 seconds.

Style prompting

The prompt input sets the overall context and delivery direction. The more specific you are, the better the results. A good prompt can include:

Who is speaking — A radio DJ, a documentary narrator, a children’s book reader. Giving the speaker a name and identity helps ground the performance.
The scene — Where are they? What’s happening? Environmental context subtly shapes delivery.
Director’s notes — Style, accent, and pacing instructions. “Infectious enthusiasm, the listener should feel part of a community event” works better than “energetic.”

Example

prompt:

AUDIO PROFILE: Jaz R., "The Morning Hype"

THE SCENE: A glass-walled studio overlooking the London skyline.
The red ON AIR tally light is blazing. Jaz is bouncing on the balls
of their heels to the rhythm of a thumping backing track.

DIRECTOR'S NOTES:
- Style: You must hear the grin in the audio. Bright, sunny, inviting.
- Accent: Jaz is from Brixton, London.
- Pace: Energetic bouncing cadence, high-speed delivery, no dead air.

text:

[excitedly] Yes, massive vibes in the studio! You are locked in and
it is absolutely popping off in London right now. [shouting] Turn
this up! We've got the project roadmap landing in three, two... let's go!

Inline tags

Tags are bracketed modifiers placed directly in your text. They give you targeted control over specific parts of the speech.

[laughing] That's hilarious! [whispering] But don't tell anyone.

Non-speech sounds

These insert an audible vocalization without speaking the tag itself.

Tag	What it does
`[sigh]`	Inserts a sigh
`[laughing]`	Inserts a laugh
`[uhm]`	Inserts a hesitation sound

Style modifiers

These change the delivery of the text that follows.

Tag	What it does
`[whispering]`	Quiet, whispered delivery
`[shouting]`	Loud, projected delivery
`[sarcasm]`	Sarcastic tone
`[robotic]`	Robotic-sounding delivery
`[extremely fast]`	Speeds up the speech

Pauses

These insert silence for rhythm and pacing control.

Tag	Duration
`[short pause]`	~250ms (like a comma)
`[medium pause]`	~500ms (like a sentence break)
`[long pause]`	~1000ms+ (dramatic effect)

You can also use any descriptive tag — [excitedly], [bored], [reluctantly], [singing], [asmr] — and the model will interpret it. Test new tags before relying on them in production, since some may be spoken aloud rather than used as modifiers.

Voices

30 prebuilt voices are available, each with a distinct character:

Voice	Gender	Character
Zephyr	Female	Bright
Puck	Male	Upbeat
Charon	Male	Informative
Kore	Female	Firm
Fenrir	Male	Excitable
Leda	Female	Youthful
Orus	Male	Firm
Aoede	Female	Breezy
Callirrhoe	Female	Easy-going
Autonoe	Female	Bright
Enceladus	Male	Breathy
Iapetus	Male	Clear
Umbriel	Male	Easy-going
Algenib	Male	Gravelly
Despina	Female	Smooth
Erinome	Female	Clear
Laomedeia	Female	Upbeat
Achernar	Female	Soft
Algieba	Male	Smooth
Schedar	Male	Even
Gacrux	Female	Mature
Pulcherrima	Female	Forward
Achird	Male	Friendly
Zubenelgenubi	Male	Casual
Vindemiatrix	Female	Gentle
Sadachbia	Male	Lively
Sadaltager	Male	Knowledgeable
Sulafat	Female	Warm
Alnilam	Male	Firm
Rasalgethi	Male	Informative

Languages

The model auto-detects input language, but you can set language_code explicitly for best results. Supported languages include English, French, German, Spanish, Portuguese, Italian, Dutch, Japanese, Korean, Hindi, Arabic, Russian, Polish, Romanian, Turkish, Thai, Vietnamese, Indonesian, and 50+ more in preview.

See Google’s full language list for BCP-47 codes.

Tips

Align everything. The style prompt, the text content, and any tags should all point in the same direction. A scared-sounding prompt works best with text that actually sounds alarming.
Don’t overspecify. The model fills in gaps naturally. Leaving some room often produces more natural results than controlling every detail.
Use rich text. Emotionally evocative text gives the model more to work with. “I just heard a window break” produces a more genuinely scared result than “Something happened.”
Tags for precision, prompts for tone. Use tags when you need a specific moment (a laugh, a pause, a whisper). Use the prompt to set the overall feel.
Punctuation matters. Commas, periods, and semicolons create natural pauses. Use them to help the model breathe.

Model created 3 months, 1 week ago

Model updated 1 month ago