General
This model is a clone of victor-upmeet/whisperx.
The purpose of this model is to offer the possibility to transcribe large audio files (a few 100 MB / a few hours long), which victor-upmeet/whisperx is not able to do due to its available RAM. Please try victor-upmeet/whisperx first, and if the model fails due to an unknown error and if you have a large audio file, you can use this model instead.
Model Information
WhisperX provides fast automatic speech recognition (70x realtime with large-v3) with word-level timestamps and speaker diarization.
Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI’s whisper does not natively support batching, but WhisperX does.
Model used is for transcription is large-v3 from faster-whisper.
For more information about WhisperX, including implementation details, see the WhisperX github repo.
Citation
@misc{bain2023whisperx,
title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
author={Max Bain and Jaesung Huh and Tengda Han and Andrew Zisserman},
year={2023},
eprint={2303.00747},
archivePrefix={arXiv},
primaryClass={cs.SD}
}