victor-upmeet / whisperx

Accelerated transcription, word-level timestamps and diarization with whisperX large-v3

  • Public
  • 1.7M runs
  • A100 (80GB)
  • GitHub
  • Paper
  • License

Input

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
*file

Audio file

string
Shift + Return to add a new line

ISO code of the language spoken in the audio, specify None to perform language detection

number

If language is not specified, then the language will be detected recursively on different parts of the file until it reaches the given probability

Default: 0

integer

If language is not specified, then the language will be detected following the logic of language_detection_min_prob parameter, but will stop after the given max retries. If max retries is reached, the most probable language is kept.

Default: 5

string
Shift + Return to add a new line

Optional text to provide as a prompt for the first window

integer

Parallelization of input audio transcription

Default: 64

number

Temperature to use for sampling

Default: 0

number

VAD onset

Default: 0.5

number

VAD offset

Default: 0.363

boolean

Aligns whisper output to get accurate word-level timestamps

Default: false

boolean

Assign speaker ID labels

Default: false

string
Shift + Return to add a new line

To enable diarization, please enter your HuggingFace token (read). You need to accept the user agreement for the models specified in the README.

integer

Minimum number of speakers if diarization is activated (leave blank if unknown)

integer

Maximum number of speakers if diarization is activated (leave blank if unknown)

boolean

Print out compute/inference times and memory usage information

Default: false

Output

segments

[ { "end": 30.811, "text": " The little tales they tell are false. The door was barred, locked and bolted as well. Ripe pears are fit for a queen's table. A big wet stain was on the round carpet. The kite dipped and swayed but stayed aloft. The pleasant hours fly by much too soon. The room was crowded with a mild wob.", "start": 2.585 }, { "end": 48.592, "text": " The room was crowded with a wild mob. This strong arm shall shield your honor. She blushed when he gave her a white orchid. The beetle droned in the hot June sun.", "start": 33.029 } ]

detected_language

en
Generated in

This example was created by a different version, victor-upmeet/whisperx:77505c70.

Run time and cost

This model costs approximately $0.024 to run on Replicate, or 41 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 18 seconds. The predict time for this model varies significantly based on the inputs.

Readme

General

The purpose of this model is to offer the possibility to transcribe audio files that do not exceed a few hours long and do not weigh more than a couple 100 MB. If you need to transcribe audio files bigger than that, please go to victor-upmeet/whisperx-a40-large which is the same model but runs on a A40 (Large) hardware. It costs more, but has more RAM which will allow very large files to be handled.

Model Information

WhisperX provides fast automatic speech recognition (70x realtime with large-v3) with word-level timestamps and speaker diarization.

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI’s whisper does not natively support batching, but WhisperX does.

Model used is for transcription is large-v3 from faster-whisper.

For more information about WhisperX, including implementation details, see the WhisperX github repo.

Citation

@misc{bain2023whisperx,
      title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio}, 
      author={Max Bain and Jaesung Huh and Tengda Han and Andrew Zisserman},
      year={2023},
      eprint={2303.00747},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}