xai/grok-text-to-speech

Convert text to natural-sounding speech with xAI's Grok TTS. 5 voices, 20 languages, expressive speech tags, and high-fidelity MP3 / WAV / telephony audio output.

46 runs

Grok Text-to-Speech is xAI’s voice synthesis model. Send it text, get back natural, expressive speech audio. Choose from five distinct voices, control delivery with inline speech tags, and pick the audio format that fits your use case — from high-fidelity MP3 to telephony-grade μ-law.

Highlights

  • 5 voices — eve, ara, rex, sal, leo. Each has its own personality and tone.
  • 20 languages with automatic language detection.
  • Speech tags — control pacing, emotion, and delivery with inline tags like [pause], [laugh], and wrapping tags like <whisper>...</whisper>.
  • Multiple output formats — MP3 (default), WAV, raw PCM, μ-law, A-law.
  • Configurable quality — sample rate from 8 kHz to 48 kHz, MP3 bit rates from 32 to 192 kbps.
  • Long inputs — up to 15,000 characters per request.
  • Text normalization — speak written-form numbers, currencies, and abbreviations the way you’d say them out loud.

Quick start

import replicate

output = replicate.run(
    "xai/grok-text-to-speech",
    input={
        "text": "Hello! Welcome to the xAI text to speech API.",
        "voice": "eve",
        "language": "en",
    },
)

with open("hello.mp3", "wb") as f:
    f.write(output.read())

Voices

Voice Tone Description
eve Energetic, upbeat Default voice — engaging and enthusiastic.
ara Warm, friendly Balanced and conversational.
rex Confident, clear Professional and articulate.
sal Smooth, balanced Versatile across content types.
leo Authoritative, strong Commanding — great for instructional content.

Speech tags

Add inline tags to control delivery:

  • Inline tags — drop in at a specific point: [pause], [long-pause], [laugh], [sigh], [breath].
  • Wrapping tags — change how a span of text is delivered: <whisper>...</whisper>, <slow>...</slow>, <soft>...</soft>.

Example:

So I walked in and [pause] there it was. [laugh] I could not believe it.
I need to tell you something. <whisper>It is a secret.</whisper> Pretty cool, right?

Output formats

Pick the codec and sample rate that match your use case:

Codec Best for
mp3 General use — wide compatibility, good compression.
wav Lossless audio — editing, post-production.
pcm Raw audio — real-time processing pipelines.
mulaw Telephony (G.711 μ-law).
alaw Telephony (G.711 A-law).

Defaults are MP3 at 24 kHz / 128 kbps. For studio-grade audio, use MP3 or WAV at 44.1 kHz or 48 kHz with 192 kbps. For telephony, use mulaw or alaw at 8 kHz.

Supported languages

Arabic (Egypt, Saudi Arabia, UAE), Bengali, Chinese (Simplified), English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (Brazil, Portugal), Russian, Spanish (Mexico, Spain), Turkish, Vietnamese.

The model auto-detects the language by default. Set language to a specific code to force a language, or to get more consistent results on noisy audio.

Pricing

Charged per character of input text. xAI’s $4.20 per 1,000,000 characters, passed through 1:1.

  • A 100-character message costs $0.00042.
  • A 1,000-character paragraph costs $0.0042.
  • A 15,000-character maximum-length request costs $0.063.
Model created