turian / insanely-fast-whisper-with-video

whisper-large-v3, incredibly fast, with video transcription

  • Public
  • 48.4K runs
  • GitHub
  • License

Input

Output

Run time and cost

This model runs on Nvidia A40 GPU hardware. Predictions typically complete within 3 minutes. The predict time for this model varies significantly based on the inputs.

Readme

Insanely Fast Whisper, with video transcription

TL;DR - Transcribe 150 minutes (2.5 hours) of audio in less than 98 seconds - with OpenAI’s Whisper Large v3. Blazingly fast transcription is now a reality!⚡️

Not convinced? Here are some benchmarks we ran on a Nvidia A100 - 80GB 👇

Optimisation type Time to Transcribe (150 mins of Audio)
large-v3 (Transformers) (fp32) ~31 (31 min 1 sec)
large-v3 (Transformers) (fp16 + batching [24] + bettertransformer) ~5 (5 min 2 sec)
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2) ~2 (1 min 38 sec)
distil-large-v2 (Transformers) (fp16 + batching [24] + bettertransformer) ~3 (3 min 16 sec)
distil-large-v2 (Transformers) (fp16 + batching [24] + Flash Attention 2) ~1 (1 min 18 sec)
large-v2 (Faster Whisper) (fp16 + beam_size [1]) ~9.23 (9 min 23 sec)
large-v2 (Faster Whisper) (8-bit + beam_size [1]) ~8 (8 min 15 sec)