Readme
Grok Text-to-Speech is xAI’s voice synthesis model. Send it text, get back natural, expressive speech audio. Choose from five distinct voices, control delivery with inline speech tags, and pick the audio format that fits your use case — from high-fidelity MP3 to telephony-grade μ-law.
Highlights
- 5 voices — eve, ara, rex, sal, leo. Each has its own personality and tone.
- 20 languages with automatic language detection.
- Speech tags — control pacing, emotion, and delivery with inline tags like
[pause],[laugh], and wrapping tags like<whisper>...</whisper>. - Multiple output formats — MP3 (default), WAV, raw PCM, μ-law, A-law.
- Configurable quality — sample rate from 8 kHz to 48 kHz, MP3 bit rates from 32 to 192 kbps.
- Long inputs — up to 15,000 characters per request.
- Text normalization — speak written-form numbers, currencies, and abbreviations the way you’d say them out loud.
Quick start
import replicate
output = replicate.run(
"xai/grok-text-to-speech",
input={
"text": "Hello! Welcome to the xAI text to speech API.",
"voice": "eve",
"language": "en",
},
)
with open("hello.mp3", "wb") as f:
f.write(output.read())
Voices
| Voice | Tone | Description |
|---|---|---|
eve |
Energetic, upbeat | Default voice — engaging and enthusiastic. |
ara |
Warm, friendly | Balanced and conversational. |
rex |
Confident, clear | Professional and articulate. |
sal |
Smooth, balanced | Versatile across content types. |
leo |
Authoritative, strong | Commanding — great for instructional content. |
Speech tags
Add inline tags to control delivery:
- Inline tags — drop in at a specific point:
[pause],[long-pause],[laugh],[sigh],[breath]. - Wrapping tags — change how a span of text is delivered:
<whisper>...</whisper>,<slow>...</slow>,<soft>...</soft>.
Example:
So I walked in and [pause] there it was. [laugh] I could not believe it.
I need to tell you something. <whisper>It is a secret.</whisper> Pretty cool, right?
Output formats
Pick the codec and sample rate that match your use case:
| Codec | Best for |
|---|---|
mp3 |
General use — wide compatibility, good compression. |
wav |
Lossless audio — editing, post-production. |
pcm |
Raw audio — real-time processing pipelines. |
mulaw |
Telephony (G.711 μ-law). |
alaw |
Telephony (G.711 A-law). |
Defaults are MP3 at 24 kHz / 128 kbps. For studio-grade audio, use MP3 or WAV at 44.1 kHz or 48 kHz with 192 kbps. For telephony, use mulaw or alaw at 8 kHz.
Supported languages
Arabic (Egypt, Saudi Arabia, UAE), Bengali, Chinese (Simplified), English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (Brazil, Portugal), Russian, Spanish (Mexico, Spain), Turkish, Vietnamese.
The model auto-detects the language by default. Set language to a specific code to force a language, or to get more consistent results on noisy audio.
Pricing
Charged per character of input text. xAI’s $4.20 per 1,000,000 characters, passed through 1:1.
- A 100-character message costs $0.00042.
- A 1,000-character paragraph costs $0.0042.
- A 15,000-character maximum-length request costs $0.063.