Forced Audio-Text Alignment

This model generates precise word-level timings from audio and text input. Feed it an audio file and its transcript, and it returns the exact timing for each word.

Example Output

[
    {
        "word": "The",
        "start": 0.0,
        "end": 0.16
    },
    {
        "word": "whole",
        "start": 0.16,
        "end": 0.32
    },
    {
        "word": "city",
        "start": 0.32,
        "end": 0.64
    }
]

Built using torchaudio’s MMS model. Supports various audio formats and includes fallback mechanisms for robust production use.