Whisper is a general-purpose speech transcription model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech transcription as well as speech translation and language identification.

We’ve created a version of Whisper which only runs the most recent Whisper model, large-v2. We still host all other model sizes in a previous version. Links to both versions are below, check out more details on the Versions page.

Model Description


A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

The code and the model weights of Whisper are released under the MIT License. See for further details.


