Whisper w/ Lazy Loading

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition, translation, and language identification.

This version allows users to choose between different model sizes, offering flexibility for various use cases.

Model Versions

Model Size	Description
tiny	Fastest, lowest accuracy
base	Fast, lower accuracy
small	Balanced speed and accuracy
medium	Slower, higher accuracy
large-v3	Slowest, highest accuracy

For the specific version using only the large-v3 model, check out our single-model version.

Model Description

Approach

Whisper uses a Transformer sequence-to-sequence model trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline.

[Blog] [Paper] [Model card]

License

The code and model weights of Whisper are released under the MIT License. See LICENSE for further details.

Citation

@misc{https://doi.org/10.48550/arxiv.2212.04356,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Model created over 1 year ago