Readme
Whisper w/ Lazy Loading
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition, translation, and language identification.
This version allows users to choose between different model sizes, offering flexibility for various use cases.
Model Versions
| Model Size | Description |
|---|---|
| tiny | Fastest, lowest accuracy |
| base | Fast, lower accuracy |
| small | Balanced speed and accuracy |
| medium | Slower, higher accuracy |
| large-v3 | Slowest, highest accuracy |
For the specific version using only the large-v3 model, check out our single-model version.
Model Description

Whisper uses a Transformer sequence-to-sequence model trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline.
License
The code and model weights of Whisper are released under the MIT License. See LICENSE for further details.
Citation
@misc{https://doi.org/10.48550/arxiv.2212.04356,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}