Transcribe any audio file with speaker diarization

Uses Whisper Large V3 Turbo + Pyannote.audio 3.3

Create transcripts with speaker labels and timestamps (diarization) easily with this model. Uses faster-whisper 1.1.1 and pyannote 3.3.1 under the hood.

Last update: 19 February 2025 Now uses the Turbo V3 model and improved diarization and transcript merging. Also segments are now shorter in length.

Usage

Input

file_string: str: Either provide a Base64 encoded audio file.
file_url: str: Or provide a direct audio file URL.
file: Path: Or provide an audio file.
num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
language: str: Language of the spoken words as a language code like ‘en’. Leave empty to auto detect language.
prompt: str: Vocabulary: provide names, acronyms, and foreign words in a list. Use punctuation for best accuracy.

Output

segments: List[Dict]: List of segments with speaker, start and end time.
Includes avg_logprob for each segment and probability for each word level segment.
num_speakers: int: Number of speakers (detected, unless specified in input).
language: str: Language of the spoken words as a language code like ‘en’ (detected, unless specified in input).

Made possible by

Speed

With A40 gpu takes about 2 minutes to transcribe + diarize a 25 minute mp3 of 2 people talking English.

About

I am a maker, building 🎙️ Audiogest, a web app that uses this model. Contact me if you’d like a demo or want to know more: thomas@audiogest.app or X/Twitter: x.com/thomas_mol

Model created over 1 year ago

Model updated 9 months, 2 weeks ago