Run time and cost

This model costs approximately $0.056 to run on Replicate, or 17 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 58 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Transcribe any audio file with speaker diarization

Uses Whisper Large V3 Turbo + Pyannote.audio 3.3

Create transcripts with speaker labels and timestamps (diarization) easily with this model. Uses faster-whisper 1.1.1 and pyannote 3.3.1 under the hood.

Last update: 19 February 2025 Now uses the Turbo V3 model and improved diarization and transcript merging. Also segments are now shorter in length.

Usage

Input

file_string: str: Either provide a Base64 encoded audio file.
file_url: str: Or provide a direct audio file URL.
file: Path: Or provide an audio file.
num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
language: str: Language of the spoken words as a language code like ‘en’. Leave empty to auto detect language.
prompt: str: Vocabulary: provide names, acronyms, and foreign words in a list. Use punctuation for best accuracy.

Output

segments: List[Dict]: List of segments with speaker, start and end time.
Includes avg_logprob for each segment and probability for each word level segment.
num_speakers: int: Number of speakers (detected, unless specified in input).
language: str: Language of the spoken words as a language code like ‘en’ (detected, unless specified in input).

Made possible by

Speed

With A40 gpu takes about 2 minutes to transcribe + diarize a 25 minute mp3 of 2 people talking English.

About

I am a maker, building 🎙️ Audiogest, a web app that uses this model. Contact me if you’d like a demo or want to know more: thomas@audiogest.app or X/Twitter: x.com/thomas_mol