nicknaskida / whisper-diarization

Cold

Public
372 runs
L40S
GitHub

Run with an API

Playground API Examples README Versions

Input

Run this model in Node.js with one line of code:

npx create-replicate --model=nicknaskida/whisper-diarization

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run nicknaskida/whisper-diarization using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "nicknaskida/whisper-diarization:c643440e783b6d1dcaef96ba97f2034ac61f02df8a3f2ae0481164ec38e8ac0d",
  {
    input: {
      translate: false,
      batch_size: 64,
      num_speakers: 2,
      group_segments: true,
      offset_seconds: 0,
      transcript_output_format: "both"
    }
  }
);

console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run nicknaskida/whisper-diarization using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "nicknaskida/whisper-diarization:c643440e783b6d1dcaef96ba97f2034ac61f02df8a3f2ae0481164ec38e8ac0d",
    input={
        "translate": False,
        "batch_size": 64,
        "num_speakers": 2,
        "group_segments": True,
        "offset_seconds": 0,
        "transcript_output_format": "both"
    }
)
print(output)

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run nicknaskida/whisper-diarization using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "nicknaskida/whisper-diarization:c643440e783b6d1dcaef96ba97f2034ac61f02df8a3f2ae0481164ec38e8ac0d",
    "input": {
      "translate": false,
      "batch_size": 64,
      "num_speakers": 2,
      "group_segments": true,
      "offset_seconds": 0,
      "transcript_output_format": "both"
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

No output yet! Press "Submit" to start a prediction.

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Whisper Diarization

Audio transcribing + diarization pipeline.

⚡️ Super Fast Transcription and Diarization: 2 hour audio in 3 mins

Models used

Whisper Large v3 (CTranslate 2 version faster-whisper==1.0.3)
Pyannote audio 3.3.1

Usage

Used at Audiogest
Or try at Replicate
Or deploy yourself at Replicate (Make sure to add your own HuggingFace API key and accept the terms of use of the pyannote models used)

Input

file_string: str: Either provide a Base64 encoded audio file.
file_url: str: Or provide a direct audio file URL.
file: Path: Or provide an audio file.
group_segments: bool: Group segments of the same speaker shorter than 2 seconds apart. Default is True.
num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
translate: bool: Translate the speech into English.
language: str: Language of the spoken words as a language code like ‘en’. Leave empty to auto detect language.
prompt: str: Vocabulary: provide names, acronyms, and loanwords in a list. Use punctuation for best accuracy. Also now used as ‘hotwords’ paramater in transcribing,
offset_seconds: int: Offset in seconds, used for chunked inputs. Default is 0.
transcript_output_format: str: Specify the format of the transcript output: individual words with timestamps, full text of segments, or a combination of both.
Default is both.
Options are words_only, segments_only, both,

Output

segments: List[Dict]: List of segments with speaker, start and end time.
Includes avg_logprob for each segment and probability for each word level segment.
num_speakers: int: Number of speakers (detected, unless specified in input).
language: str: Language of the spoken words as a language code like ‘en’ (detected, unless specified in input).

Input

Output

Run time and cost

Readme

Whisper Diarization

Models used

Usage

Input

Output

Thanks to