Official

openai / gpt-4o-transcribe

A speech-to-text model that uses GPT-4o to transcribe audio (Updated 3 weeks, 6 days ago)

  • Public
  • 1K runs
  • Priced by multiple properties
  • Commercial use
  • License
Iterate in playground

Input

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
*file

The audio file to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm

string
Shift + Return to add a new line

The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.

string
Shift + Return to add a new line

An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.

number
(minimum: 0, maximum: 1)

Sampling temperature between 0 and 1

Default: 0

Output

So we just added GPT-4o transcribe to Replicate and thought you'd want to know. It's basically a speech-to-text model that uses GPT-4o to turn your audio into text. The cool thing is that it's noticeably better than the Whisper models we've been using, fewer errors, better at recognizing different languages, and just more accurate overall. If you've ever been frustrated with transcripts that mess up technical terms or struggle with different accents, you'll probably appreciate this upgrade. It just works better. Some quick tech specs if you're curious. It has a 16,000 token context window, which means it can handle longer audio clips in one go. And it can output up to 2,000 tokens, so you'll get nice complete transcripts. The model's knowledge is current up to June 2024, so it's pretty up-to-date with language and terminology.
Generated in
Input tokens
910
Output tokens
170
Tokens per second
66.02 tokens / second
Time to first token

Pricing

Model pricing for openai/gpt-4o-transcribe. Looking for volume pricing? Get in touch.

$0.01
per thousand output tokens

or 100,000 tokens for $1

$6
per million input tokens

or around 166,666 tokens for $1

Official models are always on, maintained, and have predictable pricing. Learn more.

Check out our docs for more information about how pricing works on Replicate.

Readme

GPT-4o Transcribe is a speech-to-text model that uses GPT-4o to transcribe audio. It offers improvements to word error rate and better language recognition and accuracy compared to original Whisper models. Use it for more accurate transcripts.

  • 16,000 context window
  • 2,000 max output tokens
  • Jun 01, 2024 knowledge cutoff