Grok Speech-to-Text | xAI on Replicate

Grok Speech-to-Text is xAI’s audio transcription model. Send it an audio file, get back accurate text — with optional word-level timestamps, speaker diarization, and per-channel transcripts.

Highlights

25 languages, with automatic language detection.
Word-level timestamps — get the exact start and end time of every spoken word.
Speaker diarization — identify who said what in conversations and interviews.
Multichannel transcription — transcribe each audio channel separately (e.g. left and right channels of a phone call).
Inverse text normalization — write numbers, currencies, and units in their written form (“one hundred dollars” becomes “$100”).
Big files — up to 500 MB per request.
Many formats — WAV, MP3, M4A, FLAC, OGG, WebM, AAC, MP4, Opus.

Quick start

import replicate

output = replicate.run(
    "xai/grok-speech-to-text",
    input={
        "audio": open("podcast.mp3", "rb"),
        "timestamps": True,
    },
)

print(output["text"])
for word in output["words"]:
    print(f"{word['start']:.2f}\t{word['text']}")

Inputs

Input	Description
`audio`	Audio file to transcribe. Max 500 MB.
`language`	Language code (e.g. `en`, `fr`, `de`). `auto` (default) detects the language automatically.
`timestamps`	Include word-level start and end timestamps in the output.
`diarize`	Tag each word with a speaker index.
`multichannel`	Transcribe each audio channel independently. The audio must have at least 2 channels.
`format_text`	Convert spoken numbers, currencies, and units to written form. Requires `language` to be set to a specific code (not `auto`).

Output

{
  "text": "The full transcript as a single string.",
  "duration": 12.34,
  "language": "English",
  "words": [
    {"text": "The", "start": 0.00, "end": 0.18, "speaker": 0, "channel": null},
    {"text": "full", "start": 0.22, "end": 0.48, "speaker": 0, "channel": null}
  ]
}

words is null unless timestamps is true. Each word object includes a speaker index when diarize is on, and a channel index when multichannel is on.

Supported languages

Arabic, Czech, Danish, Dutch, English, Filipino, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Macedonian, Malay, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish, Vietnamese.

The model auto-detects the language by default. Set language to a specific code when you want to use format_text, or to nudge the model toward the right language for unusual or noisy audio.

Pricing

Charged per minute of input audio, rounded up to the nearest full minute.

A 10-second clip costs 1 minute.
A 50-second clip costs 1 minute.
A 1-minute-1-second clip costs 2 minutes.

Model created 2 months, 2 weeks ago