turian / insanely-fast-whisper-with-video

whisper-large-v3, incredibly fast, with video transcription

  • Public
  • 119.9K runs
  • L40S
  • GitHub
  • License

Input

file

Audio file. Either this or url must be provided.

string
Shift + Return to add a new line

Video URL for yt-dlp to download the audio from. Either this or audio must be provided.

string

Task to perform: transcribe or translate to another language. (default: transcribe).

Default: "transcribe"

string
Shift + Return to add a new line

Optional. Language spoken in the audio, specify None to perform language detection.

integer

Number of parallel batches you want to compute. Reduce if you face OOMs. (default: 64).

Default: 64

string

Whisper supports both chunked as well as word level timestamps. (default: chunk).

Default: "chunk"

boolean

Use Pyannote.audio to diarise the audio clips. You will need to provide hf_token below too.

Default: false

string
Shift + Return to add a new line

Provide a hf.co/settings/token for Pyannote.audio to diarise the audio clips. You need to agree to the terms in 'https://huggingface.co/pyannote/speaker-diarization-3.1' and 'https://huggingface.co/pyannote/segmentation-3.0' first.

Output

{ "text": " Do not f***ing touch it. Here is maybe the most overlooked feature or factor in the success or failure of a steak, particularly a thick steak, but it's true of all meat. This magical period immediately following its removal from the heat, it should rest on the board, meaning sit there at room temperature for five to seven minutes, at which point stay away from it. Don't touch it. Don't poke it. Don't slice it to look inside. Do not start slicing it into slices right away. What's going on inside is it is continuing to cook, but even more importantly, the juices are distributing themselves in a truly wonderful alignment. That's why if you cut into a steak too quickly off the barbecue, you get this sort of bullseye pattern instead of what it should be, a gentle graduation from red to various hues of pink to the outer crust. All the difference in the world between a good steak and a totally messed up steak is going on in that period of time that you're just doing nothing. Nothing. You want to find a good hot sizzling either grill or pan, get a good crust and sear on the outside. You want to finish it either in the oven or all the way on the grill. And then just let it sit. Don't wrap it in foil, don't cover it, don't poke it, don't prod it, don't even look at it. Just let it sit there, leave it alone, and you will be rewarded.", "chunks": [ { "text": " Do not f***ing touch it.", "timestamp": [ 0, 2.62 ] }, { "text": " Here is maybe the most overlooked feature or factor in the success or failure of a steak,", "timestamp": [ 9.46, 16.9 ] }, { "text": " particularly a thick steak, but it's true of all meat.", "timestamp": [ 17, 19.36 ] }, { "text": " This magical period immediately following its removal from the heat,", "timestamp": [ 19.78, 24 ] }, { "text": " it should rest", "timestamp": [ 24, 25.4 ] }, { "text": " on the board, meaning sit there at room temperature for five to seven minutes, at which point", "timestamp": [ 25.4, 31.32 ] }, { "text": " stay away from it.", "timestamp": [ 31.32, 33.5 ] }, { "text": " Don't touch it.", "timestamp": [ 33.5, 34.5 ] }, { "text": " Don't poke it.", "timestamp": [ 34.5, 35.72 ] }, { "text": " Don't slice it to look inside.", "timestamp": [ 35.72, 37.88 ] }, { "text": " Do not start slicing it into slices right away.", "timestamp": [ 37.88, 41.18 ] }, { "text": " What's going on inside is it is continuing to cook, but even more importantly, the juices are distributing themselves in a truly wonderful", "timestamp": [ 41.18, 51.1 ] }, { "text": " alignment. That's why if you cut into a steak too quickly off the barbecue, you", "timestamp": [ 51.1, 55.26 ] }, { "text": " get this sort of bullseye pattern instead of what it should be, a gentle", "timestamp": [ 55.26, 59.84 ] }, { "text": " graduation from red to various hues of pink to the outer crust.", "timestamp": [ 59.84, 66.16 ] }, { "text": " All the difference in the world between a good steak and a totally messed up steak is", "timestamp": [ 66.16, 70.96 ] }, { "text": " going on in that period of time that you're just doing nothing.", "timestamp": [ 70.96, 75.96 ] }, { "text": " Nothing.", "timestamp": [ 75.96, 76.96 ] }, { "text": " You want to find a good hot sizzling either grill or pan, get a good crust and sear on", "timestamp": [ 76.96, 82.86 ] }, { "text": " the outside.", "timestamp": [ 82.86, 83.86 ] }, { "text": " You want to finish it either in the oven or all the way on the grill.", "timestamp": [ 83.86, 87.88 ] }, { "text": " And then just let it sit.", "timestamp": [ 87.88, 90.2 ] }, { "text": " Don't wrap it in foil, don't cover it, don't poke it, don't prod it, don't even look at it.", "timestamp": [ 90.2, 94.98 ] }, { "text": " Just let it sit there, leave it alone, and you will be rewarded.", "timestamp": [ 94.98, 99.5 ] } ] }
Generated in

Run time and cost

This model costs approximately $0.070 to run on Replicate, or 14 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 73 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Insanely Fast Whisper, with video transcription

TL;DR - Transcribe 150 minutes (2.5 hours) of audio in less than 98 seconds - with OpenAI’s Whisper Large v3. Blazingly fast transcription is now a reality!⚡️

Not convinced? Here are some benchmarks we ran on a Nvidia A100 - 80GB 👇

Optimisation type Time to Transcribe (150 mins of Audio)
large-v3 (Transformers) (fp32) ~31 (31 min 1 sec)
large-v3 (Transformers) (fp16 + batching [24] + bettertransformer) ~5 (5 min 2 sec)
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2) ~2 (1 min 38 sec)
distil-large-v2 (Transformers) (fp16 + batching [24] + bettertransformer) ~3 (3 min 16 sec)
distil-large-v2 (Transformers) (fp16 + batching [24] + Flash Attention 2) ~1 (1 min 18 sec)
large-v2 (Faster Whisper) (fp16 + beam_size [1]) ~9.23 (9 min 23 sec)
large-v2 (Faster Whisper) (8-bit + beam_size [1]) ~8 (8 min 15 sec)