Readme
Whisper w/ Lazy Loading
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition, translation, and language identification.
This version allows users to choose between different model sizes, offering flexibility for various use cases.
Model Versions
Model Size | Description |
---|---|
tiny | Fastest, lowest accuracy |
base | Fast, lower accuracy |
small | Balanced speed and accuracy |
medium | Slower, higher accuracy |
large-v3 | Slowest, highest accuracy |
For the specific version using only the large-v3 model, check out our single-model version.
Model Description
Whisper uses a Transformer sequence-to-sequence model trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline.
License
The code and model weights of Whisper are released under the MIT License. See LICENSE for further details.
Citation
@misc{https://doi.org/10.48550/arxiv.2212.04356,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}