audioscrape/whisperx
WhisperX with speaker diarization and embeddings extraction for audio transcription
Run audioscrape/whisperx with an API
Use one of our client libraries to get started quickly. Clicking on a library will take you to the Playground tab where you can tweak different inputs, see the results, and copy the corresponding code to use in your own project.
Input schema
The fields you can use to run this model with an API. If you don't give a value for a field its default value will be used.
Field | Type | Default value | Description |
---|---|---|---|
audio |
string
|
Audio file (supports up to 4 hours)
|
|
min_speakers |
integer
|
Min: 1 Max: 20 |
Minimum number of speakers (None = auto-detect)
|
max_speakers |
integer
|
Min: 1 Max: 20 |
Maximum number of speakers (None = auto-detect)
|
language |
string
|
Language code (e.g., 'en'). Leave empty for auto-detect
|
|
huggingface_token |
string
|
HuggingFace token for speaker diarization (required)
|
|
batch_size |
integer
|
8
Min: 1 Max: 32 |
Batch size for transcription (lower for long audio)
|
enable_diarization |
boolean
|
True
|
Enable speaker diarization
|
return_word_timestamps |
boolean
|
True
|
Return word-level timestamps
|
{
"type": "object",
"title": "Input",
"required": [
"audio",
"huggingface_token"
],
"properties": {
"audio": {
"type": "string",
"title": "Audio",
"format": "uri",
"x-order": 0,
"description": "Audio file (supports up to 4 hours)"
},
"language": {
"type": "string",
"title": "Language",
"x-order": 3,
"nullable": true,
"description": "Language code (e.g., 'en'). Leave empty for auto-detect"
},
"batch_size": {
"type": "integer",
"title": "Batch Size",
"default": 8,
"maximum": 32,
"minimum": 1,
"x-order": 5,
"description": "Batch size for transcription (lower for long audio)"
},
"max_speakers": {
"type": "integer",
"title": "Max Speakers",
"maximum": 20,
"minimum": 1,
"x-order": 2,
"nullable": true,
"description": "Maximum number of speakers (None = auto-detect)"
},
"min_speakers": {
"type": "integer",
"title": "Min Speakers",
"maximum": 20,
"minimum": 1,
"x-order": 1,
"nullable": true,
"description": "Minimum number of speakers (None = auto-detect)"
},
"huggingface_token": {
"type": "string",
"title": "Huggingface Token",
"x-order": 4,
"description": "HuggingFace token for speaker diarization (required)"
},
"enable_diarization": {
"type": "boolean",
"title": "Enable Diarization",
"default": true,
"x-order": 6,
"description": "Enable speaker diarization"
},
"return_word_timestamps": {
"type": "boolean",
"title": "Return Word Timestamps",
"default": true,
"x-order": 7,
"description": "Return word-level timestamps"
}
}
}
Output schema
The shape of the response you’ll get when you run this model with an API.
{
"type": "object",
"title": "Output",
"additionalProperties": true
}