thomasmol/whisper-diarization

Public Transcribes any audio file (base64, url) with speaker diarization. *Please read instructions below*
Demo API Examples Versions (dc8c2de0)

Run time and cost

Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 8 minutes. The predict time for this model varies significantly based on the inputs.

Transcribe any audio file with speaker diarization

Create transcriptions with speaker labels and timestamps (diarization) easily with this model. Uses whisper and pyannote under the hood. Will continue improving performance + accuracy.

How to use 🪄

  • input a file as a base64 string or a file url (must be direct and public link to file)
  • input filename with file extension
  • provide number of speakers
  • give prompt to improve accuracy of transcript
  • other inputs are used if you provide chunks of files
  • hit submit and wait!

Need a easier interface to use this model?

Head over to 🎙️ Audiogest, which is a webapp I made that uses this model. On the app you can upload any audio file and get the transcription produced by this model and generate useful summaries!

No file urls or base64 strings needed!

Or support me here Buy me a coffee And support these fantastic developers and researchers 🙏:

Model description

Uses faster-whisper for transcribing and pyannote speaker embedding "speechbrain/spkrec-ecapa-voxceleb" model for speaker diarization.

Input is the base64 string of an audio file or a file url.

Intended use

Easily transcribe and get speaker labels from any audio format.

Ethical considerations

🤷 Same as any AI model. Your input is not used for fine-tuning.

Caveats and recommendations

Takes long, also for short audio clips, because of possible cold boot.

Recently improved by 4x by using faster-whisper.

Diarization is not perfect