twangodev/qwenasr | Readme and Docs

Speech recognition and forced alignment powered by Qwen3-ASR-1.7B and Qwen3-ForcedAligner-0.6B. Returns word-level timestamps for transcription or text-audio alignment.

Modes

Transcribe: Convert audio to text with precise word-level timestamps and automatic language detection.

Align: Align provided text to audio, producing exact start/end times for each word. Supports Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish.

Output

Both modes return word-level timestamps:

{
  "text": "Hello world",
  "words": [
    {"text": "Hello", "start_time": 0.0, "end_time": 0.32},
    {"text": "world", "start_time": 0.34, "end_time": 0.72}
  ]
}

Self-hosting

See the GitHub repository for Docker and local deployment options.

Model created 4 months, 2 weeks ago

Model updated 4 months, 1 week ago