WhisperX with Speaker Embeddings
Production-ready WhisperX model for high-quality audio transcription with speaker diarization and voice embeddings extraction.
Features
- Transcription: State-of-the-art WhisperX large-v3 model
- Speaker Diarization: PyAnnote-based multi-speaker identification
- Voice Embeddings: 256-dimensional speaker embeddings for voice matching
- Long-form Support: Handles audio files up to 4 hours
- Word-level Timestamps: Precise alignment for each word
Use Cases
- Podcast transcription with speaker identification
- Meeting transcription with participant tracking
- Interview analysis with speaker separation
- Voice-based speaker matching across episodes
Input Parameters
audio
: Audio file URL or uploadhuggingface_token
: Required for speaker diarization (get one at huggingface.co)enable_diarization
: Enable speaker separation (default: true)min_speakers
: Minimum expected speakers (optional)max_speakers
: Maximum expected speakers (optional)batch_size
: Processing batch size (default: 8)return_word_timestamps
: Include word-level timing (default: true)
Output
Returns JSON with:
- segments
: Transcribed text with timestamps and speaker labels
- speaker_embeddings
: 256-dimensional voice embeddings for each speaker
- metadata
: Language, duration, speaker count