twangodev/qwenasr

Serve QwenASR speech recognition and alignment.

Public
752 runs

Run time and cost

This model costs approximately $0.0015 to run on Replicate, or 666 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 7 seconds.

Readme

Speech recognition and forced alignment powered by Qwen3-ASR-1.7B and Qwen3-ForcedAligner-0.6B. Returns word-level timestamps for transcription or text-audio alignment.

Modes

Transcribe: Convert audio to text with precise word-level timestamps and automatic language detection.

Align: Align provided text to audio, producing exact start/end times for each word. Supports Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish.

Output

Both modes return word-level timestamps:

{
  "text": "Hello world",
  "words": [
    {"text": "Hello", "start_time": 0.0, "end_time": 0.32},
    {"text": "world", "start_time": 0.34, "end_time": 0.72}
  ]
}

Self-hosting

See the GitHub repository for Docker and local deployment options.

Model created
Model updated