collectiveai-team / crisperwhisper

Unofficial implementation of Verbatim Automatic Speech Recognition with improved word-level timestamps and filler detection

  • Public
  • 54 runs
  • Paper

CrisperWhisper

CrisperWhisper is an advanced variant of OpenAI’s Whisper, designed for fast, precise, and verbatim speech recognition with accurate (crisp) word-level timestamps. Unlike the original Whisper, which tends to omit disfluencies and follows more of a intended transcription style, CrisperWhisper aims to transcribe every spoken word exactly as it is, including fillers, pauses, stutters and false starts.

Key Features

  • 🎯 Accurate Word-Level Timestamps: Provides precise timestamps, even around disfluencies and pauses, by utilizing an adjusted tokenizer and a custom attention loss during training.
  • 📝 Verbatim Transcription: Transcribes every spoken word exactly as it is, including and differentiating fillers like “um” and “uh”.
  • 🔍 Filler Detection: Detects and accurately transcribes fillers.
  • 🛡️ Hallucination Mitigation: Minimizes transcription hallucinations to enhance accuracy.