daanelson/whisperx | Run with an API on Replicate

Input

Video Player is loading.

Current Time 00:00:000

Duration 00:00:000

Loaded: 0%

Stream Type LIVE

Remaining Time 00:00:000

audio

*file

Audio file

batch_size

integer

Parallelization of input audio transcription

Default: 32

align_output

boolean

Use if you need word-level timing and not just batched transcription. Only works for English atm

Default: false

only_text

boolean

Set if you only want to return text; otherwise, segment metadata will be returned as well.

Default: false

debug

boolean

Print out memory usage information.

Default: false

Run this model in Node.js with one line of code:

npx create-replicate --model=daanelson/whisperx

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run daanelson/whisperx using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "daanelson/whisperx:9aa6ecadd30610b81119fc1b6807302fd18ca6cbb39b3216f430dcf23618cedd",
  {
    input: {
      audio: "https://replicate.delivery/pbxt/J5r78wKSymorzW9idAbbbJ7iXQl9GddZTwfdX5OlLJW2hLR2/OSR_uk_000_0050_8k.wav",
      debug: false,
      only_text: false,
      batch_size: 32,
      align_output: false
    }
  }
);

console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run daanelson/whisperx using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "daanelson/whisperx:9aa6ecadd30610b81119fc1b6807302fd18ca6cbb39b3216f430dcf23618cedd",
    input={
        "audio": "https://replicate.delivery/pbxt/J5r78wKSymorzW9idAbbbJ7iXQl9GddZTwfdX5OlLJW2hLR2/OSR_uk_000_0050_8k.wav",
        "debug": False,
        "only_text": False,
        "batch_size": 32,
        "align_output": False
    }
)

print(output)

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run daanelson/whisperx using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "daanelson/whisperx:9aa6ecadd30610b81119fc1b6807302fd18ca6cbb39b3216f430dcf23618cedd",
    "input": {
      "audio": "https://replicate.delivery/pbxt/J5r78wKSymorzW9idAbbbJ7iXQl9GddZTwfdX5OlLJW2hLR2/OSR_uk_000_0050_8k.wav",
      "debug": false,
      "only_text": false,
      "batch_size": 32,
      "align_output": false
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

[ { "end": 30.772, "text": " The little tales they tell are false. The door was barred, locked and bolted as well. Ripe pears are fit for a queen's table. A big wet stain was on the round carpet. The kite dipped and swayed but stayed aloft. The pleasant hours fly by much too soon. The room was crowded with a mild wob.", "start": 2.557 }, { "end": 48.558, "text": " The room was crowded with a wild mob. This strong arm shall shield your honour. She blushed when he gave her a white orchid. The beetle droned in the hot June sun.", "start": 32.999 } ]

Generated in

2.7 seconds

Tweak it Report View full prediction

Run time and cost

This model costs approximately $0.032 to run on Replicate, or 31 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 143 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Model Information

WhisperX provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI’s whisper does not natively support batching, but WhisperX does.

This implementation of WhisperX supports transcription of all supported Whisper languages, and alignment of English audio. WhisperX supports alignment of multiple languages, English is the only alignment supported at present for transcription speed.

For more information about WhisperX, including implementation details, see the WhisperX github repo.

Citation

@misc{bain2023whisperx,
      title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio}, 
      author={Max Bain and Jaesung Huh and Tengda Han and Andrew Zisserman},
      year={2023},
      eprint={2303.00747},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}