Official

openai / gpt-4o-transcribe

A speech-to-text model that uses GPT-4o to transcribe audio

  • Public
  • 475 runs
  • Priced per token
  • Commercial use
  • License
Iterate in playground

Input

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
*file

The audio file to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm

string
Shift + Return to add a new line

The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.

string
Shift + Return to add a new line

An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.

number
(minimum: 0, maximum: 1)

Sampling temperature between 0 and 1

Default: 0

Output

So we just added GPT-4o transcribe to Replicate and thought you'd want to know. It's basically a speech-to-text model that uses GPT-4o to turn your audio into text. The cool thing is that it's noticeably better than the Whisper models we've been using, fewer errors, better at recognizing different languages, and just more accurate overall. If you've ever been frustrated with transcripts that mess up technical terms or struggle with different accents, you'll probably appreciate this upgrade. It just works better. Some quick tech specs if you're curious. It has a 16,000 token context window, which means it can handle longer audio clips in one go. And it can output up to 2,000 tokens, so you'll get nice complete transcripts. The model's knowledge is current up to June 2024, so it's pretty up-to-date with language and terminology.
Generated in
Input tokens
910
Output tokens
170
Tokens per second
66.02 tokens / second
Time to first token

Pricing

Official model
Pricing for official models works differently from other models. Instead of being billed by time, you’re billed by input and output, making pricing more predictable.

This model is priced by how many input tokens are sent and how many output tokens are generated.

TypePer unitPer $1
Input
$6.00 / 1M tokens
or
160K tokens / $1
Output
$10.00 / 1M tokens
or
100K tokens / $1

For example, for $10 you can run around 1,429 predictions where the input is a sentence or two (15 tokens) and the output is a few paragraphs (700 tokens).

Check out our docs for more information about how per-token pricing works on Replicate.

Readme

GPT-4o Transcribe is a speech-to-text model that uses GPT-4o to transcribe audio. It offers improvements to word error rate and better language recognition and accuracy compared to original Whisper models. Use it for more accurate transcripts.

  • 16,000 context window
  • 2,000 max output tokens
  • Jun 01, 2024 knowledge cutoff