daanelson / whisperx

Accelerated transcription of audio using WhisperX

  • Public
  • 62.3K runs
  • T4
  • GitHub
  • Paper
  • License

Input

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
*file

Audio file

integer

Parallelization of input audio transcription

Default: 32

boolean

Use if you need word-level timing and not just batched transcription. Only works for English atm

Default: false

boolean

Set if you only want to return text; otherwise, segment metadata will be returned as well.

Default: false

boolean

Print out memory usage information.

Default: false

Output

[ { "end": 30.772, "text": " The little tales they tell are false. The door was barred, locked and bolted as well. Ripe pears are fit for a queen's table. A big wet stain was on the round carpet. The kite dipped and swayed but stayed aloft. The pleasant hours fly by much too soon. The room was crowded with a mild wob.", "start": 2.557 }, { "end": 48.558, "text": " The room was crowded with a wild mob. This strong arm shall shield your honour. She blushed when he gave her a white orchid. The beetle droned in the hot June sun.", "start": 32.999 } ]
Generated in

Run time and cost

This model costs approximately $0.0042 to run on Replicate, or 238 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 19 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Model Information

WhisperX provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI’s whisper does not natively support batching, but WhisperX does.

This implementation of WhisperX supports transcription of all supported Whisper languages, and alignment of English audio. WhisperX supports alignment of multiple languages, English is the only alignment supported at present for transcription speed.

For more information about WhisperX, including implementation details, see the WhisperX github repo.

Citation

@misc{bain2023whisperx,
      title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio}, 
      author={Max Bain and Jaesung Huh and Tengda Han and Andrew Zisserman},
      year={2023},
      eprint={2303.00747},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}