Whisper Large-v3

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition, translation, and language identification.

This version runs only the most recent Whisper model, large-v3. It’s optimized for high performance and simplicity.

Model Versions

Model Size Version
large-v3 link
large-v2 link
all others link

While this implementation only uses the large-v3 model, we maintain links to previous versions for reference.

For users who need different model sizes, check out our multi-model version.

Model Description


Whisper uses a Transformer sequence-to-sequence model trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline.

[Blog] [Paper] [Model card]


The code and model weights of Whisper are released under the MIT License. See LICENSE for further details.


