audioscrape/whisperx

WhisperX with speaker diarization and embeddings extraction for audio transcription

Public
55 runs

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

WhisperX with Speaker Embeddings

Production-ready WhisperX model for high-quality audio transcription with speaker diarization and voice embeddings extraction.

Features

  • Transcription: State-of-the-art WhisperX large-v3 model
  • Speaker Diarization: PyAnnote-based multi-speaker identification
  • Voice Embeddings: 256-dimensional speaker embeddings for voice matching
  • Long-form Support: Handles audio files up to 4 hours
  • Word-level Timestamps: Precise alignment for each word

Use Cases

  • Podcast transcription with speaker identification
  • Meeting transcription with participant tracking
  • Interview analysis with speaker separation
  • Voice-based speaker matching across episodes

Input Parameters

  • audio: Audio file URL or upload
  • huggingface_token: Required for speaker diarization (get one at huggingface.co)
  • enable_diarization: Enable speaker separation (default: true)
  • min_speakers: Minimum expected speakers (optional)
  • max_speakers: Maximum expected speakers (optional)
  • batch_size: Processing batch size (default: 8)
  • return_word_timestamps: Include word-level timing (default: true)

Output

Returns JSON with: - segments: Transcribed text with timestamps and speaker labels - speaker_embeddings: 256-dimensional voice embeddings for each speaker - metadata: Language, duration, speaker count