vaibhavs10 / incredibly-fast-whisper

whisper-large-v3, incredibly fast, powered by Hugging Face Transformers! 🤗

  • Public
  • 4M runs
  • L40S
  • GitHub
  • License

Input

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
*file

Audio file

string

Task to perform: transcribe or translate to another language.

Default: "transcribe"

string

Language spoken in the audio, specify 'None' to perform language detection.

Default: "None"

integer

Number of parallel batches you want to compute. Reduce if you face OOMs.

Default: 24

string

Whisper supports both chunked as well as word level timestamps.

Default: "chunk"

boolean

Use Pyannote.audio to diarise the audio clips. You will need to provide hf_token below too.

Default: false

string
Shift + Return to add a new line

Provide a hf.co/settings/token for Pyannote.audio to diarise the audio clips. You need to agree to the terms in 'https://huggingface.co/pyannote/speaker-diarization-3.1' and 'https://huggingface.co/pyannote/segmentation-3.0' first.

Output

{ "text": " the little tales they tell are false the door was barred locked and bolted as well ripe pears are fit hours fly by much too soon. The room was crowded with a mild wab. The room was crowded with a wild mob. This strong arm shall shield your honour. She blushed when he gave her a white orchid The beetle droned in the hot June sun", "chunks": [ { "text": " the little tales they tell are false the door was barred locked and bolted as well ripe pears are fit hours fly by much too soon. The room was crowded", "timestamp": [ 0, 29.72 ] }, { "text": " with a mild wab. The room was crowded with a wild mob. This strong arm shall shield your", "timestamp": [ 29.72, 38.98 ] }, { "text": " honour. She blushed when he gave her a white orchid The beetle droned in the hot June sun", "timestamp": [ 38.98, 48.52 ] } ] }
Generated in

This example was created by a different version, vaibhavs10/incredibly-fast-whisper:37dfc0d6.

Run time and cost

This model costs approximately $0.0040 to run on Replicate, or 250 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 5 seconds.

Readme

Incredibly Fast Whisper

Powered by 🤗 Transformers, Optimum & flash-attn

TL;DR - Transcribe 150 minutes of audio in 100 seconds - with OpenAI’s Whisper Large v3. Blazingly fast transcription is now a reality!⚡️

Optimisation type Time to Transcribe (150 mins of Audio)
Transformers (fp32) ~31 (31 min 1 sec)
Transformers (fp16 + batching [24] + bettertransformer) ~5 (5 min 2 sec)
Transformers (fp16 + batching [24] + Flash Attention 2) ~2 (1 min 38 sec)
distil-whisper (fp16 + batching [24] + bettertransformer) ~3 (3 min 16 sec)
distil-whisper (fp16 + batching [24] + Flash Attention 2) ~1 (1 min 18 sec)
Faster Whisper (fp16 + beam_size [1]) ~9.23 (9 min 23 sec)
Faster Whisper (8-bit + beam_size [1]) ~8 (8 min 15 sec)