Transcribe speech with openai/whisper
1.7K runs

Run time and cost

Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 65 seconds. The predict time for this model varies significantly based on the inputs.


This is a Cog implementation of the
Code for the demo is here


[Model card]

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.



A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.


The code and the model weights of Whisper are released under the MIT License. See LICENSE for further details.