Grok Text-to-Speech | xAI on Replicate

Grok Text-to-Speech is xAI’s voice synthesis model. Send it text, get back natural, expressive speech audio. Choose from five distinct voices, control delivery with inline speech tags, and pick the audio format that fits your use case — from high-fidelity MP3 to telephony-grade μ-law.

Highlights

5 voices — eve, ara, rex, sal, leo. Each has its own personality and tone.
20 languages with automatic language detection.
Speech tags — control pacing, emotion, and delivery with inline tags like [pause], [laugh], and wrapping tags like <whisper>...</whisper>.
Multiple output formats — MP3 (default), WAV, raw PCM, μ-law, A-law.
Configurable quality — sample rate from 8 kHz to 48 kHz, MP3 bit rates from 32 to 192 kbps.
Long inputs — up to 15,000 characters per request.
Text normalization — speak written-form numbers, currencies, and abbreviations the way you’d say them out loud.

Quick start

import replicate

output = replicate.run(
    "xai/grok-text-to-speech",
    input={
        "text": "Hello! Welcome to the xAI text to speech API.",
        "voice": "eve",
        "language": "en",
    },
)

with open("hello.mp3", "wb") as f:
    f.write(output.read())

Voices

Voice	Tone	Description
`eve`	Energetic, upbeat	Default voice — engaging and enthusiastic.
`ara`	Warm, friendly	Balanced and conversational.
`rex`	Confident, clear	Professional and articulate.
`sal`	Smooth, balanced	Versatile across content types.
`leo`	Authoritative, strong	Commanding — great for instructional content.

Speech tags

Add inline tags to control delivery:

Inline tags — drop in at a specific point: [pause], [long-pause], [laugh], [sigh], [breath].
Wrapping tags — change how a span of text is delivered: <whisper>...</whisper>, <slow>...</slow>, <soft>...</soft>.

Example:

So I walked in and [pause] there it was. [laugh] I could not believe it.

I need to tell you something. <whisper>It is a secret.</whisper> Pretty cool, right?

Output formats

Pick the codec and sample rate that match your use case:

Codec	Best for
`mp3`	General use — wide compatibility, good compression.
`wav`	Lossless audio — editing, post-production.
`pcm`	Raw audio — real-time processing pipelines.
`mulaw`	Telephony (G.711 μ-law).
`alaw`	Telephony (G.711 A-law).

Defaults are MP3 at 24 kHz / 128 kbps. For studio-grade audio, use MP3 or WAV at 44.1 kHz or 48 kHz with 192 kbps. For telephony, use mulaw or alaw at 8 kHz.

Supported languages

Arabic (Egypt, Saudi Arabia, UAE), Bengali, Chinese (Simplified), English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (Brazil, Portugal), Russian, Spanish (Mexico, Spain), Turkish, Vietnamese.

The model auto-detects the language by default. Set language to a specific code to force a language, or to get more consistent results on noisy audio.

Model created 2 months, 2 weeks ago