openai / whisper

Convert speech in audio to text

Demo API Examples Versions (91ee9c0c)

Run time and cost

Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 55 seconds. The predict time for this model varies significantly based on the inputs.

Whisper is a general-purpose speech transcription model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech transcription as well as speech translation and language identification.

We’ve created a version of Whisper which only runs the most recent Whisper model, large-v2. We still host all other model sizes in a previous version. Links to both versions are below, check out more details on the Versions page.

Model Size Version
large-v2 link
all others link

Model Description

Approach

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

[Blog] [Paper] [Model card]

License

The code and the model weights of Whisper are released under the MIT License. See https://github.com/openai/whisper/blob/main/LICENSE for further details.

Citation

@misc{https://doi.org/10.48550/arxiv.2212.04356,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Machine Learning (cs.LG), Sound (cs.SD), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}