twangodev/qwenasr

Serve QwenASR speech recognition and alignment.

Public
1.3K runs

Run time and cost

This model costs approximately $0.0013 to run on Replicate, or 769 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 6 seconds.

Readme

Speech recognition and forced alignment powered by Qwen3-ASR-1.7B and Qwen3-ForcedAligner-0.6B. Returns word-level timestamps for transcription or text-audio alignment.

Modes

Transcribe: Convert audio to text with precise word-level timestamps and automatic language detection.

Align: Align provided text to audio, producing exact start/end times for each word. Supports Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish.

Output

Both modes return word-level timestamps:

{
  "text": "Hello world",
  "words": [
    {"text": "Hello", "start_time": 0.0, "end_time": 0.32},
    {"text": "world", "start_time": 0.34, "end_time": 0.72}
  ]
}

Self-hosting

See the GitHub repository for Docker and local deployment options.

Model created