keeandev/whisperx | Readme and Docs

Model Information

WhisperX provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI’s whisper does not natively support batching, but WhisperX does.

This implementation of WhisperX supports transcription of all supported Whisper languages, and alignment of English audio. WhisperX supports alignment of multiple languages, English is the only alignment supported at present for transcription speed. This implementation has also been upgraded to whisper-large-v3, thanks to the new dataset release by OpenAI, along with diarization upgrades to speaker-diarization-3.1 & segmentation-3.0, powered by pyannote.

For more information about WhisperX, including implementation details, see the WhisperX github repo.

Citation

@misc{bain2023whisperx,
      title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio}, 
      author={Max Bain and Jaesung Huh and Tengda Han and Andrew Zisserman},
      year={2023},
      eprint={2303.00747},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}