collectiveai-team / crisperwhisper

Unofficial implementation of Verbatim Automatic Speech Recognition with improved word-level timestamps and filler detection

  • Public
  • 2.3K runs
  • Paper

CrisperWhisper

CrisperWhisper is an advanced variant of OpenAIโ€™s Whisper, designed for fast, precise, and verbatim speech recognition with accurate (crisp) word-level timestamps. Unlike the original Whisper, which tends to omit disfluencies and follows more of a intended transcription style, CrisperWhisper aims to transcribe every spoken word exactly as it is, including fillers, pauses, stutters and false starts.

Key Features

  • ๐ŸŽฏ Accurate Word-Level Timestamps: Provides precise timestamps, even around disfluencies and pauses, by utilizing an adjusted tokenizer and a custom attention loss during training.
  • ๐Ÿ“ Verbatim Transcription: Transcribes every spoken word exactly as it is, including and differentiating fillers like โ€œumโ€ and โ€œuhโ€.
  • ๐Ÿ” Filler Detection: Detects and accurately transcribes fillers.
  • ๐Ÿ›ก๏ธ Hallucination Mitigation: Minimizes transcription hallucinations to enhance accuracy.