thomasmol / whisper-diarization

⚡️ Fast audio transcription | whisper large-v3 | speaker diarization | word & sentence level timestamps | prompt | hotwords

  • Public
  • 722K runs
  • GitHub
  • Paper
  • License

Transcribe any audio file with speaker diarization

Uses Whisper Large V3 + Pyannote.audio 3.3

Create transcripts with speaker labels and timestamps (diarization) easily with this model. Uses faster-whisper 1.0.3 and pyannote 3.3.1 under the hood.

Last update: 8 July 2024 Updated to latest faster-whisper version with improved VAD: say goodbye to hallucinations! Support for ‘hotwords’, which are used like initial_prompt in Whisper, but are added to each window.

Usage

Input

  • file_string: str: Either provide a Base64 encoded audio file.
  • file_url: str: Or provide a direct audio file URL.
  • file: Path: Or provide an audio file.
  • group_segments: bool: Group segments of the same speaker shorter than 2 seconds apart. Default is True.
  • num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
  • language: str: Language of the spoken words as a language code like ‘en’. Leave empty to auto detect language.
  • prompt: str: Vocabulary: provide names, acronyms, and foreign words in a list. Also used as the ‘hotwords’ parameter of faster-whisper. Use punctuation for best accuracy.
  • offset_seconds: int: Offset in seconds, used for chunked inputs. Default is 0.
  • transcript_output_format: str: Specify the format of the transcript output: individual words with timestamps, full text of segments, or a combination of both.
  • Default is both.
  • Options are words_only, segments_only, both,

Output

  • segments: List[Dict]: List of segments with speaker, start and end time.
  • Includes avg_logprob for each segment and probability for each word level segment.
  • num_speakers: int: Number of speakers (detected, unless specified in input).
  • language: str: Language of the spoken words as a language code like ‘en’ (detected, unless specified in input).

Made possible by

Speed

With A40 gpu takes about 2 minutes to transcribe + diarize a 25 minute mp3 of 2 people talking English.

About

I am a maker, building 🎙️ Audiogest, a web app that uses this model. Upload audio or video files and generate a transcripts and summaries. Also edit and export transcripts. Also building Spectropic AI, a simple API wrapper of this model. Contact me if you’d like a demo or want to know more: thomas@spectropic.ai or X/Twitter: x.com/thomas_mol