Readme
Speech recognition and forced alignment powered by Qwen3-ASR-1.7B and Qwen3-ForcedAligner-0.6B. Returns word-level timestamps for transcription or text-audio alignment.
Modes
Transcribe: Convert audio to text with precise word-level timestamps and automatic language detection.
Align: Align provided text to audio, producing exact start/end times for each word. Supports Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish.
Output
Both modes return word-level timestamps:
{
"text": "Hello world",
"words": [
{"text": "Hello", "start_time": 0.0, "end_time": 0.32},
{"text": "world", "start_time": 0.34, "end_time": 0.72}
]
}
Self-hosting
See the GitHub repository for Docker and local deployment options.