Collections

Speaker diarization

Identify who is speaking in your audio or video files and turn messy conversations into structured timelines.

These models can separate speakers, generate transcripts, add timestamps, handle overlapping dialogue, and work with long recordings.

They are useful for meetings, interviews, podcasts, call centers, and any workflow that needs reliable speaker labeling.

If you are interested in models that transcribe speech to text, check out our Transcribe Speech to Text collection.

Recommended Models

Frequently asked questions

Which models are the fastest for diarization and transcription?

If you want the fastest possible transcription, use vaibhavs10/incredibly-fast-whisper. It processes long audio extremely quickly while staying cost efficient.
If you need fast transcription plus speaker labels and word-level timestamps, choose victor-upmeet/whisperx, which adds diarization and alignment with only a small slowdown.
For difficult audio with several speakers or noisy rooms, rafaelgalle/whisper-diarization-advanced is fast while providing extra filtering and cleanup tools.

Which models give the best balance between cost and accuracy?

A strong balance of speed, cost, and accuracy comes from thomasmol/whisper-diarization, which uses Whisper Large with a reliable diarization component.
If you mainly need speaker segmentation without transcription, meronym/speaker-diarization is extremely inexpensive and very efficient.
For premium accuracy across many languages and advanced features such as custom vocabulary and translation, consider sabuhigr/sabuhi-model.

What works best for real time or near real time usage in meetings or calls?

For quick turnarounds, start with vaibhavs10/incredibly-fast-whisper or victor-upmeet/whisperx, since they can process long audio in seconds.
If you have overlapping speakers or low-quality conference calls, rafaelgalle/whisper-diarization-advanced handles noisy scenarios more gracefully.
If you only need speaker timing and not full text, lucataco/speaker-diarization or collectiveai-team/speaker-diarization-3 can return segment boundaries very fast.

What should I use for multilingual audio, heavy noise, or technical vocabulary?

If your audio includes multiple languages or domain-specific words, sabuhigr/sabuhi-model supports translation, multilingual transcription, and custom vocabulary.
For English audio with noise, echo, or inconsistent volume, rafaelgalle/whisper-diarization-advanced gives you detailed control over filters and preprocessing.
For clean single-language audio, thomasmol/whisper-diarization or vaibhavs10/incredibly-fast-whisper will usually perform well.

What is the difference between diarization only models and transcription plus diarization models?

Diarization only models such as lucataco/speaker-diarization, collectiveai-team/speaker-diarization-3, and meronym/speaker-diarization focus on identifying speaker changes and labeling segments without generating text.
Transcription plus diarization models such as thomasmol/whisper-diarization, victor-upmeet/whisperx, and sabuhigr/sabuhi-model generate a transcript along with speaker labels and timestamps.
Choose diarization only when you just need speaker boundaries, and choose combined models when you need the full text for every speaker.

What kinds of outputs should I expect from these models?

Diarization only models like meronym/speaker-diarization output a list of speaker segments with start and stop times. Some also include speaker embeddings that can be used for matching voices across recordings.
Transcription plus diarization models like victor-upmeet/whisperx output transcripts with speaker labels and timestamps, often down to the word.
You can expect JSON, text transcripts, or structured segment lists depending on the model.

How can I self host or push a diarization or transcription model to Replicate?

Models such as thomasmol/whisper-diarization and meronym/speaker-transcription are open source and can be self hosted with Docker or Cog.
To publish your own model on Replicate, create a replicate.yaml file with your inputs and outputs, package your environment, and push the repository. Replicate will run it on managed GPUs with no extra setup.
If your model uses external diarization components such as Pyannote, make sure to include any required tokens or licensing notes.

Can I use these models for commercial work?

Yes, most models allow commercial use as long as you follow their licenses. Check the licensing sections on each model page such as sabuhigr/sabuhi-model or victor-upmeet/whisperx.
Be mindful of privacy and consent when processing user generated audio.
If you plan to use diarization embeddings or store speaker profiles, check whether the model license or your use case has extra requirements.

How do I run these models in practice?

Upload your audio or provide a URL, choose parameters such as number of speakers, translation, or noise filtering, then run the model.
Use vaibhavs10/incredibly-fast-whisper if you need fast transcription, victor-upmeet/whisperx if you need timestamps and speaker labels, and rafaelgalle/whisper-diarization-advanced if you need advanced audio cleaning.
If you only need speaker boundaries, meronym/speaker-diarization or lucataco/speaker-diarization will work well.

What should I know before running a job in this category?

Make sure your audio format is supported.
If you have a good guess about speaker count, set minimum and maximum speakers for models like thomasmol/whisper-diarization.
For noisy recordings, models with preprocessing such as rafaelgalle/whisper-diarization-advanced or sabuhigr/sabuhi-model will often improve accuracy.
Long audio files increase cost, especially for premium models such as awerks/whisperx. Always test a short sample first.

Any other collection specific tips or considerations?

Some workflows combine models. For example, you can diarize with meronym/speaker-diarization and then transcribe each speaker segment with vaibhavs10/incredibly-fast-whisper.
For high quality meeting transcripts with precise timings, victor-upmeet/whisperx is an excellent option.
For call center recordings or stereo interviews, rafaelgalle/whisper-diarization-advanced can treat each channel separately for better accuracy.
For multilingual recordings or translation, sabuhigr/sabuhi-model is the most flexible choice.

What if I want to build a full workflow for meetings or interviews?

You can ingest audio, apply diarization, generate transcripts, attach speaker labels, create speaker level summaries, and export subtitle files.
For fast English workflows use meronym/speaker-transcription.
For multilingual or high accuracy workflows use sabuhigr/sabuhi-model.
For detailed word timing and speaker attribution use victor-upmeet/whisperx.
Integrate the outputs into your application so that each meeting comes back with structured timestamps, speaker identities, and complete transcripts.