openai/ whisper

Convert speech in audio to text

Run time and cost

This model runs on Nvidia T4 (High-memory) GPU hardware. Predictions typically complete within 4 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Whisper is a general-purpose speech transcription model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech transcription as well as speech translation and language identification.

We’ve created a version of Whisper which only runs the most recent Whisper model, large-v2. We still host all other model sizes in a previous version. Links to both versions are below, check out more details on the Versions page.

Model Size Version
large-v2 link
all others link

Model Description

Approach

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

[Blog] [Paper] [Model card]

License

The code and the model weights of Whisper are released under the MIT License. See https://github.com/openai/whisper/blob/main/LICENSE for further details.

Citation

@misc{https://doi.org/10.48550/arxiv.2212.04356,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Machine Learning (cs.LG), Sound (cs.SD), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}