xai/grok-speech-to-text

Transcribe audio to text with xAI's Grok. Handles 25 languages, word-level timestamps, speaker diarization, multichannel audio, and files up to 500 MB.

8 runs

Grok Speech-to-Text is xAI’s audio transcription model. Send it an audio file, get back accurate text — with optional word-level timestamps, speaker diarization, and per-channel transcripts.

Highlights

  • 25 languages, with automatic language detection.
  • Word-level timestamps — get the exact start and end time of every spoken word.
  • Speaker diarization — identify who said what in conversations and interviews.
  • Multichannel transcription — transcribe each audio channel separately (e.g. left and right channels of a phone call).
  • Inverse text normalization — write numbers, currencies, and units in their written form (“one hundred dollars” becomes “$100”).
  • Big files — up to 500 MB per request.
  • Many formats — WAV, MP3, M4A, FLAC, OGG, WebM, AAC, MP4, Opus.

Quick start

import replicate

output = replicate.run(
    "xai/grok-speech-to-text",
    input={
        "audio": open("podcast.mp3", "rb"),
        "timestamps": True,
    },
)

print(output["text"])
for word in output["words"]:
    print(f"{word['start']:.2f}\t{word['text']}")

Inputs

Input Description
audio Audio file to transcribe. Max 500 MB.
language Language code (e.g. en, fr, de). auto (default) detects the language automatically.
timestamps Include word-level start and end timestamps in the output.
diarize Tag each word with a speaker index.
multichannel Transcribe each audio channel independently. The audio must have at least 2 channels.
format_text Convert spoken numbers, currencies, and units to written form. Requires language to be set to a specific code (not auto).

Output

{
  "text": "The full transcript as a single string.",
  "duration": 12.34,
  "language": "English",
  "words": [
    {"text": "The", "start": 0.00, "end": 0.18, "speaker": 0, "channel": null},
    {"text": "full", "start": 0.22, "end": 0.48, "speaker": 0, "channel": null}
  ]
}

words is null unless timestamps is true. Each word object includes a speaker index when diarize is on, and a channel index when multichannel is on.

Supported languages

Arabic, Czech, Danish, Dutch, English, Filipino, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Macedonian, Malay, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish, Vietnamese.

The model auto-detects the language by default. Set language to a specific code when you want to use format_text, or to nudge the model toward the right language for unusual or noisy audio.

Pricing

Charged per minute of input audio, rounded up to the nearest full minute.

  • A 10-second clip costs 1 minute.
  • A 50-second clip costs 1 minute.
  • A 1-minute-1-second clip costs 2 minutes.
Model created