audioscrape/whisperx

WhisperX with speaker diarization and embeddings extraction for audio transcription

Public
55 runs

WhisperX with Speaker Embeddings

Production-ready WhisperX model for high-quality audio transcription with speaker diarization and voice embeddings extraction.

Features

  • Transcription: State-of-the-art WhisperX large-v3 model
  • Speaker Diarization: PyAnnote-based multi-speaker identification
  • Voice Embeddings: 256-dimensional speaker embeddings for voice matching
  • Long-form Support: Handles audio files up to 4 hours
  • Word-level Timestamps: Precise alignment for each word

Use Cases

  • Podcast transcription with speaker identification
  • Meeting transcription with participant tracking
  • Interview analysis with speaker separation
  • Voice-based speaker matching across episodes

Input Parameters

  • audio: Audio file URL or upload
  • huggingface_token: Required for speaker diarization (get one at huggingface.co)
  • enable_diarization: Enable speaker separation (default: true)
  • min_speakers: Minimum expected speakers (optional)
  • max_speakers: Maximum expected speakers (optional)
  • batch_size: Processing batch size (default: 8)
  • return_word_timestamps: Include word-level timing (default: true)

Output

Returns JSON with: - segments: Transcribed text with timestamps and speaker labels - speaker_embeddings: 256-dimensional voice embeddings for each speaker - metadata: Language, duration, speaker count