openai/whisper

Public
Transcribe speech with openai/whisper
85.2K runs

Run time and cost

Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 5 minutes. The predict time for this model varies significantly based on the inputs.

This is a Cog implementation of the https://github.com/openai/whisper
Code for the demo is here https://github.com/chenxwh/cog-whisper

Whisper

[Blog]
[Paper]
[Model card]

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

Approach

Approach

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

License

The code and the model weights of Whisper are released under the MIT License. See https://github.com/openai/whisper/blob/main/LICENSE for further details.

Replicate