nicknaskida / whisper-diarization

⚡️ Insanely Fast audio transcription | whisper large-v3 | speaker diarization | word & sentence level timestamps | prompt | hotwords. Fork of thomasmol/whisper-diarization. Added batched whisper, 3x-4x speedup 🚀

  • Public
  • 372 runs
  • GitHub

Input

Run this model in Node.js with one line of code:

npx create-replicate --model=nicknaskida/whisper-diarization
or set up a project from scratch
npm install replicate
Set the REPLICATE_API_TOKEN environment variable:
export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:
import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run nicknaskida/whisper-diarization using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "nicknaskida/whisper-diarization:c643440e783b6d1dcaef96ba97f2034ac61f02df8a3f2ae0481164ec38e8ac0d",
  {
    input: {
      translate: false,
      batch_size: 64,
      num_speakers: 2,
      group_segments: true,
      offset_seconds: 0,
      transcript_output_format: "both"
    }
  }
);

console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Output

No output yet! Press "Submit" to start a prediction.

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Whisper Diarization

Audio transcribing + diarization pipeline.

⚡️ Super Fast Transcription and Diarization: 2 hour audio in 3 mins

Models used

  • Whisper Large v3 (CTranslate 2 version faster-whisper==1.0.3)
  • Pyannote audio 3.3.1

Usage

  • Used at Audiogest
  • Or try at Replicate
  • Or deploy yourself at Replicate (Make sure to add your own HuggingFace API key and accept the terms of use of the pyannote models used)

Input

  • file_string: str: Either provide a Base64 encoded audio file.
  • file_url: str: Or provide a direct audio file URL.
  • file: Path: Or provide an audio file.
  • group_segments: bool: Group segments of the same speaker shorter than 2 seconds apart. Default is True.
  • num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
  • translate: bool: Translate the speech into English.
  • language: str: Language of the spoken words as a language code like ‘en’. Leave empty to auto detect language.
  • prompt: str: Vocabulary: provide names, acronyms, and loanwords in a list. Use punctuation for best accuracy. Also now used as ‘hotwords’ paramater in transcribing,
  • offset_seconds: int: Offset in seconds, used for chunked inputs. Default is 0.
  • transcript_output_format: str: Specify the format of the transcript output: individual words with timestamps, full text of segments, or a combination of both.
  • Default is both.
  • Options are words_only, segments_only, both,

Output

  • segments: List[Dict]: List of segments with speaker, start and end time.
  • Includes avg_logprob for each segment and probability for each word level segment.
  • num_speakers: int: Number of speakers (detected, unless specified in input).
  • language: str: Language of the spoken words as a language code like ‘en’ (detected, unless specified in input).

Thanks to