thomasmol / whisper-diarization

⚡️ Fast audio transcription | whisper large-v3 | speaker diarization | word & sentence level timestamps | prompt | hotwords

  • Public
  • 456.3K runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model costs approximately $0.090 to run on Replicate, or 11 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A40 GPU hardware. Predictions typically complete within 3 minutes. The predict time for this model varies significantly based on the inputs.

Readme

Transcribe any audio file with speaker diarization

Uses Whisper Large V3 + Pyannote.audio 3.3

Create transcripts with speaker labels and timestamps (diarization) easily with this model. Uses faster-whisper 1.0.3 and pyannote 3.3.1 under the hood.

Last update: 8 July 2024 Updated to latest faster-whisper version with improved VAD: say goodbye to hallucinations! Support for ‘hotwords’, which are used like initial_prompt in Whisper, but are added to each window.

Usage

Input

  • file_string: str: Either provide a Base64 encoded audio file.
  • file_url: str: Or provide a direct audio file URL.
  • file: Path: Or provide an audio file.
  • group_segments: bool: Group segments of the same speaker shorter than 2 seconds apart. Default is True.
  • num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
  • language: str: Language of the spoken words as a language code like ‘en’. Leave empty to auto detect language.
  • prompt: str: Vocabulary: provide names, acronyms, and foreign words in a list. Also used as the ‘hotwords’ parameter of faster-whisper. Use punctuation for best accuracy.
  • offset_seconds: int: Offset in seconds, used for chunked inputs. Default is 0.
  • transcript_output_format: str: Specify the format of the transcript output: individual words with timestamps, full text of segments, or a combination of both.
  • Default is both.
  • Options are words_only, segments_only, both,

Output

  • segments: List[Dict]: List of segments with speaker, start and end time.
  • Includes avg_logprob for each segment and probability for each word level segment.
  • num_speakers: int: Number of speakers (detected, unless specified in input).
  • language: str: Language of the spoken words as a language code like ‘en’ (detected, unless specified in input).

Made possible by

Speed

With A40 gpu takes about 2 minutes to transcribe + diarize a 25 minute mp3 of 2 people talking English.

About

I am a maker, building 🎙️ Audiogest, a web app that uses this model. Upload audio or video files and generate a transcripts and summaries. Also edit and export transcripts. Also building Spectropic AI, a simple API wrapper of this model. Contact me if you’d like a demo or want to know more: thomas@spectropic.ai or X/Twitter: x.com/thomas_mol