Readme

VibeVoice ASR

VibeVoice ASR is a long-form speech recognition model from Microsoft that transcribes spoken audio and returns structured speaker timestamp segments. It is useful for turning interviews, demos, podcasts, meetings, and generated speech samples into readable text with timing metadata.

This Replicate model uses the Transformers-native microsoft/VibeVoice-ASR-HF checkpoint for the original microsoft/VibeVoice-ASR release.

What it returns

The model returns:

transcription: the plain text transcription.
segments: a list of timestamped speech segments with start time, end time, speaker id, and text.
raw_output: the raw model output for debugging or downstream parsing.
inference_time_seconds: time spent running inference after the model is loaded.

Inputs

audio: The audio file to transcribe. WAV and common audio formats are supported.
prompt: Optional context or hotwords. Use this for names, product terms, acronyms, or topic hints that may improve recognition.
max_new_tokens: Maximum number of generated text tokens. The default works for short and medium clips; increase for longer audio.
tokenizer_chunk_size: Audio tokenizer chunk size in samples. The default is best for throughput. Use a smaller value such as 64000 when testing or when memory is constrained.

Good use cases

Transcribing speech demos and generated audio samples.
Creating timestamped transcripts for podcasts, meetings, interviews, and videos.
Extracting structured ASR segments for downstream editing or search.
Evaluating speech generation models by transcribing their outputs.
Building dataset preparation or QA workflows for long-form audio.

Tips

Add a short prompt when the audio includes unusual names, product names, or domain-specific vocabulary.
For short test clips, tokenizer_chunk_size of 64000 is a good fast QA setting.
For production workloads, start with the default settings unless you need to tune memory use.
Check segments when you need timestamps; use transcription when you only need plain text.

Limitations

ASR quality depends on audio clarity, background noise, accents, and speaker overlap.
Speaker ids are model-produced segment labels, not a full diarization pipeline with persistent speaker identity guarantees.
Very long audio may require higher token limits and may take longer to process.
The model may make mistakes on rare names, code-switching, music, or heavily distorted speech.

License

The wrapper is Apache-2.0. The model weights are governed by the upstream Microsoft VibeVoice-ASR license and model card.

Model created 3 weeks, 3 days ago

Run time and cost