thomasmol/whisper-diarization

⚡️ Blazing fast audio transcription with speaker diarization | Whisper Large V3 Turbo & pyannote 4.0 community-1 | word & sentence level timestamps | prompt

Public
8.4M runs

Transcribe any audio file with speaker diarization

Uses Whisper Large V3 Turbo + Pyannote Speaker Diarization Community-1

Create transcripts with speaker labels, timestamps, and word-level timing. Uses faster-whisper 1.2.1 and pyannote.audio 4.0.4 with pyannote/speaker-diarization-community-1 under the hood.

Last update: 10 June 2026

Now uses Pyannote Speaker Diarization Community-1, updated Faster Whisper, newer dependencies, local model loading, and improved audio handling.

Usage

Input

  • file_string: str: Base64 encoded audio file.
  • file_url: str: Direct audio file URL.
  • file: Path: Audio file upload.
  • num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
  • translate: bool: Translate speech into English.
  • language: str: Language of spoken words as a language code like en. Leave empty to auto-detect.
  • prompt: str: Vocabulary: provide names, acronyms, and loanwords. Use punctuation for best accuracy.

Provide exactly one of file_string, file_url, or file.

Output

  • segments: List[Dict]: Transcript segments with speaker, text, start time, end time, duration, and word-level details.
  • Includes avg_logprob for each segment.
  • Includes probability, timestamps, and speaker labels for each word-level segment.
  • num_speakers: int: Number of speakers detected, unless specified in input.
  • language: str: Spoken language detected, unless specified in input.

Made possible by

Speed

With an L40S GPU, it takes <1 minute to transcribe and diarize a 25 minute MP3 with 2 people speaking English.

About

Contact me if you’d like a demo or want to know more:

Model created
Model updated