thomasmol / whisper-diarization

⚡️ Fast audio transcription | whisper v3 | speaker diarization | word level timestamps | prompt

  • Public
  • 243.2K runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia A40 GPU hardware. Predictions typically complete within 99 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Transcribe any audio file with speaker diarization

Uses Whisper Large V3 + Pyannote.audio 3.1

Create transcripts with speaker labels and timestamps (diarization) easily with this model. Uses faster-whisper 0.10.0 and pyannote 3.1.1 under the hood.

Usage

Input

  • file_string: str: Either provide a Base64 encoded audio file.
  • file_url: str: Or provide a direct audio file URL.
  • file: Path: Or provide an audio file.
  • group_segments: bool: Group segments of the same speaker shorter than 2 seconds apart. Default is True.
  • num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
  • language: str: Language of the spoken words as a language code like ‘en’. Leave empty to auto detect language.
  • prompt: str: Vocabulary: provide names, acronyms, and loanwords in a list. Use punctuation for best accuracy.
  • offset_seconds: int: Offset in seconds, used for chunked inputs. Default is 0.
  • transcript_output_format: str: Specify the format of the transcript output: individual words with timestamps, full text of segments, or a combination of both.
  • Default is both.
  • Options are words_only, segments_only, both,

Output

  • segments: List[Dict]: List of segments with speaker, start and end time.
  • Includes avg_logprob for each segment and probability for each word level segment.
  • num_speakers: int: Number of speakers (detected, unless specified in input).
  • language: str: Language of the spoken words as a language code like ‘en’ (detected, unless specified in input).

Made possible by

Speed

With A40 gpu takes about 2 minutes to transcribe + diarize a 25 minute mp3 of 2 people talking English.

About

I am a maker, building 🎙️ Audiogest, a web app that uses this model. Upload audio or video files and generate a transcripts and summaries. Also edit and export transcripts. Also building Spectropic AI, a simple API wrapper of this model. Contact me if you’d like a demo or want to know more: spectropic@thomasmol.com or X/Twitter: x.com/thomas_mol