Whisper w/ Lazy Loading
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition, translation, and language identification.
This version allows users to choose between different model sizes, offering flexibility for various use cases.
Model Versions
| Model Size | Description | 
|---|---|
| tiny | Fastest, lowest accuracy | 
| base | Fast, lower accuracy | 
| small | Balanced speed and accuracy | 
| medium | Slower, higher accuracy | 
| large-v3 | Slowest, highest accuracy | 
For the specific version using only the large-v3 model, check out our single-model version.
Model Description

Whisper uses a Transformer sequence-to-sequence model trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline.
License
The code and model weights of Whisper are released under the MIT License. See LICENSE for further details.
Citation
@misc{https://doi.org/10.48550/arxiv.2212.04356,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}
