Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

WhisperX with Speaker Embeddings

Production-ready WhisperX model for high-quality audio transcription with speaker diarization and voice embeddings extraction.

Features

Transcription: State-of-the-art WhisperX large-v3 model
Speaker Diarization: PyAnnote-based multi-speaker identification
Voice Embeddings: 256-dimensional speaker embeddings for voice matching
Long-form Support: Handles audio files up to 4 hours
Word-level Timestamps: Precise alignment for each word

Use Cases

Podcast transcription with speaker identification
Meeting transcription with participant tracking
Interview analysis with speaker separation
Voice-based speaker matching across episodes

Input Parameters

audio: Audio file URL or upload
huggingface_token: Required for speaker diarization (get one at huggingface.co)
enable_diarization: Enable speaker separation (default: true)
min_speakers: Minimum expected speakers (optional)
max_speakers: Maximum expected speakers (optional)
batch_size: Processing batch size (default: 8)
return_word_timestamps: Include word-level timing (default: true)

Output

Returns JSON with: - segments: Transcribed text with timestamps and speaker labels - speaker_embeddings: 256-dimensional voice embeddings for each speaker - metadata: Language, duration, speaker count