Grok Speech-to-Text is xAI’s audio transcription model. Send it an audio file, get back accurate text — with optional word-level timestamps, speaker diarization, and per-channel transcripts.
Highlights
- 25 languages, with automatic language detection.
- Word-level timestamps — get the exact start and end time of every spoken word.
- Speaker diarization — identify who said what in conversations and interviews.
- Multichannel transcription — transcribe each audio channel separately (e.g. left and right channels of a phone call).
- Inverse text normalization — write numbers, currencies, and units in their written form (“one hundred dollars” becomes “$100”).
- Big files — up to 500 MB per request.
- Many formats — WAV, MP3, M4A, FLAC, OGG, WebM, AAC, MP4, Opus.
Quick start
import replicate
output = replicate.run(
"xai/grok-speech-to-text",
input={
"audio": open("podcast.mp3", "rb"),
"timestamps": True,
},
)
print(output["text"])
for word in output["words"]:
print(f"{word['start']:.2f}\t{word['text']}")
Inputs
| Input | Description |
|---|---|
audio |
Audio file to transcribe. Max 500 MB. |
language |
Language code (e.g. en, fr, de). auto (default) detects the language automatically. |
timestamps |
Include word-level start and end timestamps in the output. |
diarize |
Tag each word with a speaker index. |
multichannel |
Transcribe each audio channel independently. The audio must have at least 2 channels. |
format_text |
Convert spoken numbers, currencies, and units to written form. Requires language to be set to a specific code (not auto). |
Output
{
"text": "The full transcript as a single string.",
"duration": 12.34,
"language": "English",
"words": [
{"text": "The", "start": 0.00, "end": 0.18, "speaker": 0, "channel": null},
{"text": "full", "start": 0.22, "end": 0.48, "speaker": 0, "channel": null}
]
}
words is null unless timestamps is true. Each word object includes a speaker index when diarize is on, and a channel index when multichannel is on.
Supported languages
Arabic, Czech, Danish, Dutch, English, Filipino, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Macedonian, Malay, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish, Vietnamese.
The model auto-detects the language by default. Set language to a specific code when you want to use format_text, or to nudge the model toward the right language for unusual or noisy audio.
Pricing
Charged per minute of input audio, rounded up to the nearest full minute.
- A 10-second clip costs 1 minute.
- A 50-second clip costs 1 minute.
- A 1-minute-1-second clip costs 2 minutes.