WhisperX with Speaker Embeddings

Production-ready WhisperX model for high-quality audio transcription with speaker diarization and voice embeddings extraction.

Features

Transcription: State-of-the-art WhisperX large-v3 model
Speaker Diarization: PyAnnote-based multi-speaker identification
Voice Embeddings: 256-dimensional speaker embeddings for voice matching
Long-form Support: Handles audio files up to 4 hours
Word-level Timestamps: Precise alignment for each word

Use Cases

Podcast transcription with speaker identification
Meeting transcription with participant tracking
Interview analysis with speaker separation
Voice-based speaker matching across episodes

Input Parameters

audio: Audio file URL or upload
huggingface_token: Required for speaker diarization (get one at huggingface.co)
enable_diarization: Enable speaker separation (default: true)
min_speakers: Minimum expected speakers (optional)
max_speakers: Maximum expected speakers (optional)
batch_size: Processing batch size (default: 8)
return_word_timestamps: Include word-level timing (default: true)

Output

Returns JSON with: - segments: Transcribed text with timestamps and speaker labels - speaker_embeddings: 256-dimensional voice embeddings for each speaker - metadata: Language, duration, speaker count

WhisperX with Speaker Embeddings

Features

Use Cases

Input Parameters

Output

Links