carnifexer / whisperx

ASR with word alignment based on whisperx using whisper medium (769M)

  • Public
  • 13.4K runs
  • T4
  • GitHub
  • Paper
  • License

Input

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
*file

Audio file

integer

Parallelization of input audio transcription

Default: 32

boolean

Use if you need word-level timing and not just batched transcription

Default: false

boolean

Set if you only want to return text; otherwise, segment metadata will be returned as well.

Default: false

boolean

Print out memory usage information.

Default: false

Output

[{"start": 0.028, "end": 1.289, "text": " It's the imagination.", "words": [{"word": "It's", "start": 0.028, "end": 0.128, "score": 0.218}, {"word": "the", "start": 0.148, "end": 0.248, "score": 0.683}, {"word": "imagination.", "start": 0.268, "end": 0.929, "score": 0.873}]}, {"start": 1.289, "end": 10.116, "text": "If you examine any large-scale human cooperation, you always find fiction as its basis.", "words": [{"word": "If", "start": 1.289, "end": 1.369, "score": 0.84}, {"word": "you", "start": 1.389, "end": 1.529, "score": 0.83}, {"word": "examine", "start": 1.589, "end": 2.07, "score": 0.806}, {"word": "any", "start": 2.35, "end": 2.57, "score": 0.77}, {"word": "large-scale", "start": 2.87, "end": 3.791, "score": 0.864}, {"word": "human", "start": 3.871, "end": 4.151, "score": 0.94}, {"word": "cooperation,", "start": 4.191, "end": 4.872, "score": 0.785}, {"word": "you", "start": 5.212, "end": 5.392, "score": 0.754}, {"word": "always", "start": 5.553, "end": 5.953, "score": 0.678}, {"word": "find", "start": 6.193, "end": 6.573, "score": 0.844}, {"word": "fiction", "start": 6.794, "end": 7.294, "score": 0.837}, {"word": "as", "start": 8.255, "end": 8.335, "score": 0.94}, {"word": "its", "start": 8.395, "end": 8.515, "score": 0.724}, {"word": "basis.", "start": 8.615, "end": 9.055, "score": 0.916}]}, {"start": 10.116, "end": 14.72, "text": "It's a fictional story that holds lots of strangers together.", "words": [{"word": "It's", "start": 10.116, "end": 10.216, "score": 0.88}, {"word": "a", "start": 10.256, "end": 10.276, "score": 0.979}, {"word": "fictional", "start": 10.356, "end": 10.877, "score": 0.92}, {"word": "story", "start": 10.957, "end": 11.417, "score": 0.82}, {"word": "that", "start": 11.577, "end": 11.718, "score": 0.916}, {"word": "holds", "start": 11.818, "end": 12.178, "score": 0.804}, {"word": "lots", "start": 12.558, "end": 12.758, "score": 0.876}, {"word": "of", "start": 12.798, "end": 12.858, "score": 0.769}, {"word": "strangers", "start": 13.079, "end": 13.779, "score": 0.845}, {"word": "together.", "start": 14.28, "end": 14.72, "score": 0.881}]}]
Generated in

Run time and cost

This model costs approximately $0.0019 to run on Replicate, or 526 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 9 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI’s whisper does not natively support batching, but WhisperX does.

This implementation of WhisperX uses the more light-weight whipser medium model that mainly support english.

For more information about WhisperX, including implementation details, see the WhisperX github repo.

Citation

If you use this in your research, please cite the paper:

@article{bain2022whisperx,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  journal={INTERSPEECH 2023},
  year={2023}
}