sidedwards / whisperx

WhisperX with accelerated transcription and advanced speaker diarization provides fast and accurate transcriptions with speaker segments.

  • Public
  • 190 runs
  • L40S
  • GitHub

Input

file

Audio file (mp3 or mp4)

file

Audio file as a Blob (mp3 or mp4)

string
Shift + Return to add a new line

ISO code of the language spoken in the audio, specify None to perform language detection

number

If language is not specified, then the language will be detected recursively on different parts of the file until it reaches the given probability

Default: 0

integer

If language is not specified, then the language will be detected following the logic of language_detection_min_prob parameter, but will stop after the given max retries. If max retries is reached, the most probable language is kept.

Default: 5

string
Shift + Return to add a new line

Optional text to provide as a prompt for the first window

integer

Parallelization of input audio transcription

Default: 64

number

Temperature to use for sampling

Default: 0

number

VAD onset

Default: 0.5

number

VAD offset

Default: 0.363

boolean

Aligns whisper output to get accurate word-level timestamps

Default: false

boolean

Assign speaker ID labels

Default: false

string
Shift + Return to add a new line

To enable diarization, please enter your HuggingFace token (read). You need to accept the user agreement for the models specified in the README.

integer

Minimum number of speakers if diarization is activated (leave blank if unknown)

integer

Maximum number of speakers if diarization is activated (leave blank if unknown)

boolean

Print out compute/inference times and memory usage information

Default: false

Output

No output yet! Press "Submit" to start a prediction.

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

WhisperX with accelerated transcription and advanced speaker diarization provides fast and accurate transcriptions with speaker segments.

Based on: https://github.com/victor-upmeet/whisperx-replicate

Citation

@misc{bain2023whisperx,
      title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio}, 
      author={Max Bain and Jaesung Huh and Tengda Han and Andrew Zisserman},
      year={2023},
      eprint={2303.00747},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

For more information, visit the WhisperX GitHub repository.