🚀 Want to run this model with an API? Get started


Convert speech in audio to text
335K runs

Run time and cost

Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 3 minutes. The predict time for this model varies significantly based on the inputs.

This is a Cog implementation of the https://github.com/openai/whisper
Code for the demo is here https://github.com/chenxwh/cog-whisper


[Model card]

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.



A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.


The code and the model weights of Whisper are released under the MIT License. See https://github.com/openai/whisper/blob/main/LICENSE for further details.