Model details
Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model.
It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2
.
See the full model card here