carnifexer/whisperx | Run with an API on Replicate

Input

Video Player is loading.

Current Time 00:00:000

Duration 00:00:000

Loaded: 0%

Stream Type LIVE

Remaining Time 00:00:000

audio

*file

Audio file

batch_size

integer

Parallelization of input audio transcription

Default: 32

align_output

boolean

Use if you need word-level timing and not just batched transcription

Default: false

only_text

boolean

Set if you only want to return text; otherwise, segment metadata will be returned as well.

Default: false

debug

boolean

Print out memory usage information.

Default: false

Run this model in Node.js with one line of code:

npx create-replicate --model=carnifexer/whisperx

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run carnifexer/whisperx using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "carnifexer/whisperx:1e0315854645f245d04ff09f5442778e97b8588243c7fe40c644806bde297e04",
  {
    input: {
      audio: "https://replicate.delivery/pbxt/JNkWNoivCCKEgLrBXKzqrMFSmJ5XT7hEvFO4w2led0avEURe/audio.mp3",
      debug: true,
      only_text: false,
      batch_size: 16,
      align_output: true
    }
  }
);

console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run carnifexer/whisperx using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "carnifexer/whisperx:1e0315854645f245d04ff09f5442778e97b8588243c7fe40c644806bde297e04",
    input={
        "audio": "https://replicate.delivery/pbxt/JNkWNoivCCKEgLrBXKzqrMFSmJ5XT7hEvFO4w2led0avEURe/audio.mp3",
        "debug": True,
        "only_text": False,
        "batch_size": 16,
        "align_output": True
    }
)
print(output)

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run carnifexer/whisperx using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "carnifexer/whisperx:1e0315854645f245d04ff09f5442778e97b8588243c7fe40c644806bde297e04",
    "input": {
      "audio": "https://replicate.delivery/pbxt/JNkWNoivCCKEgLrBXKzqrMFSmJ5XT7hEvFO4w2led0avEURe/audio.mp3",
      "debug": true,
      "only_text": false,
      "batch_size": 16,
      "align_output": true
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

[{"start": 0.028, "end": 1.289, "text": " It's the imagination.", "words": [{"word": "It's", "start": 0.028, "end": 0.128, "score": 0.218}, {"word": "the", "start": 0.148, "end": 0.248, "score": 0.683}, {"word": "imagination.", "start": 0.268, "end": 0.929, "score": 0.873}]}, {"start": 1.289, "end": 10.116, "text": "If you examine any large-scale human cooperation, you always find fiction as its basis.", "words": [{"word": "If", "start": 1.289, "end": 1.369, "score": 0.84}, {"word": "you", "start": 1.389, "end": 1.529, "score": 0.83}, {"word": "examine", "start": 1.589, "end": 2.07, "score": 0.806}, {"word": "any", "start": 2.35, "end": 2.57, "score": 0.77}, {"word": "large-scale", "start": 2.87, "end": 3.791, "score": 0.864}, {"word": "human", "start": 3.871, "end": 4.151, "score": 0.94}, {"word": "cooperation,", "start": 4.191, "end": 4.872, "score": 0.785}, {"word": "you", "start": 5.212, "end": 5.392, "score": 0.754}, {"word": "always", "start": 5.553, "end": 5.953, "score": 0.678}, {"word": "find", "start": 6.193, "end": 6.573, "score": 0.844}, {"word": "fiction", "start": 6.794, "end": 7.294, "score": 0.837}, {"word": "as", "start": 8.255, "end": 8.335, "score": 0.94}, {"word": "its", "start": 8.395, "end": 8.515, "score": 0.724}, {"word": "basis.", "start": 8.615, "end": 9.055, "score": 0.916}]}, {"start": 10.116, "end": 14.72, "text": "It's a fictional story that holds lots of strangers together.", "words": [{"word": "It's", "start": 10.116, "end": 10.216, "score": 0.88}, {"word": "a", "start": 10.256, "end": 10.276, "score": 0.979}, {"word": "fictional", "start": 10.356, "end": 10.877, "score": 0.92}, {"word": "story", "start": 10.957, "end": 11.417, "score": 0.82}, {"word": "that", "start": 11.577, "end": 11.718, "score": 0.916}, {"word": "holds", "start": 11.818, "end": 12.178, "score": 0.804}, {"word": "lots", "start": 12.558, "end": 12.758, "score": 0.876}, {"word": "of", "start": 12.798, "end": 12.858, "score": 0.769}, {"word": "strangers", "start": 13.079, "end": 13.779, "score": 0.845}, {"word": "together.", "start": 14.28, "end": 14.72, "score": 0.881}]}]

{
  "completed_at": "2023-08-20T08:19:10.456869Z",
  "created_at": "2023-08-20T07:58:54.485648Z",
  "data_removed": false,
  "error": null,
  "id": "yjhkayrbi3pbakkwjn4dwvzrqa",
  "input": {
    "audio": "https://replicate.delivery/pbxt/JNkWNoivCCKEgLrBXKzqrMFSmJ5XT7hEvFO4w2led0avEURe/audio.mp3",
    "debug": true,
    "only_text": false,
    "batch_size": 16,
    "align_output": true
  },
  "logs": "max gpu memory allocated over runtime: 0.77 GB",
  "metrics": {
    "predict_time": 1.568893,
    "total_time": 1215.971221
  },
  "output": "[{\"start\": 0.028, \"end\": 1.289, \"text\": \" It's the imagination.\", \"words\": [{\"word\": \"It's\", \"start\": 0.028, \"end\": 0.128, \"score\": 0.218}, {\"word\": \"the\", \"start\": 0.148, \"end\": 0.248, \"score\": 0.683}, {\"word\": \"imagination.\", \"start\": 0.268, \"end\": 0.929, \"score\": 0.873}]}, {\"start\": 1.289, \"end\": 10.116, \"text\": \"If you examine any large-scale human cooperation, you always find fiction as its basis.\", \"words\": [{\"word\": \"If\", \"start\": 1.289, \"end\": 1.369, \"score\": 0.84}, {\"word\": \"you\", \"start\": 1.389, \"end\": 1.529, \"score\": 0.83}, {\"word\": \"examine\", \"start\": 1.589, \"end\": 2.07, \"score\": 0.806}, {\"word\": \"any\", \"start\": 2.35, \"end\": 2.57, \"score\": 0.77}, {\"word\": \"large-scale\", \"start\": 2.87, \"end\": 3.791, \"score\": 0.864}, {\"word\": \"human\", \"start\": 3.871, \"end\": 4.151, \"score\": 0.94}, {\"word\": \"cooperation,\", \"start\": 4.191, \"end\": 4.872, \"score\": 0.785}, {\"word\": \"you\", \"start\": 5.212, \"end\": 5.392, \"score\": 0.754}, {\"word\": \"always\", \"start\": 5.553, \"end\": 5.953, \"score\": 0.678}, {\"word\": \"find\", \"start\": 6.193, \"end\": 6.573, \"score\": 0.844}, {\"word\": \"fiction\", \"start\": 6.794, \"end\": 7.294, \"score\": 0.837}, {\"word\": \"as\", \"start\": 8.255, \"end\": 8.335, \"score\": 0.94}, {\"word\": \"its\", \"start\": 8.395, \"end\": 8.515, \"score\": 0.724}, {\"word\": \"basis.\", \"start\": 8.615, \"end\": 9.055, \"score\": 0.916}]}, {\"start\": 10.116, \"end\": 14.72, \"text\": \"It's a fictional story that holds lots of strangers together.\", \"words\": [{\"word\": \"It's\", \"start\": 10.116, \"end\": 10.216, \"score\": 0.88}, {\"word\": \"a\", \"start\": 10.256, \"end\": 10.276, \"score\": 0.979}, {\"word\": \"fictional\", \"start\": 10.356, \"end\": 10.877, \"score\": 0.92}, {\"word\": \"story\", \"start\": 10.957, \"end\": 11.417, \"score\": 0.82}, {\"word\": \"that\", \"start\": 11.577, \"end\": 11.718, \"score\": 0.916}, {\"word\": \"holds\", \"start\": 11.818, \"end\": 12.178, \"score\": 0.804}, {\"word\": \"lots\", \"start\": 12.558, \"end\": 12.758, \"score\": 0.876}, {\"word\": \"of\", \"start\": 12.798, \"end\": 12.858, \"score\": 0.769}, {\"word\": \"strangers\", \"start\": 13.079, \"end\": 13.779, \"score\": 0.845}, {\"word\": \"together.\", \"start\": 14.28, \"end\": 14.72, \"score\": 0.881}]}]",
  "started_at": "2023-08-20T08:19:08.887976Z",
  "status": "succeeded",
  "urls": {
    "get": "https://api.replicate.com/v1/predictions/yjhkayrbi3pbakkwjn4dwvzrqa",
    "cancel": "https://api.replicate.com/v1/predictions/yjhkayrbi3pbakkwjn4dwvzrqa/cancel"
  },
  "version": "1e0315854645f245d04ff09f5442778e97b8588243c7fe40c644806bde297e04"
}

Generated in

1.6 seconds

Tweak it Report View full prediction

Run time and cost

This model costs approximately $0.0019 to run on Replicate, or 526 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 9 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI’s whisper does not natively support batching, but WhisperX does.

This implementation of WhisperX uses the more light-weight whipser medium model that mainly support english.

For more information about WhisperX, including implementation details, see the WhisperX github repo.

Citation

If you use this in your research, please cite the paper:

@article{bain2022whisperx,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  journal={INTERSPEECH 2023},
  year={2023}
}