Forced Audio-Text Alignment
This model generates precise word-level timings from audio and text input. Feed it an audio file and its transcript, and it returns the exact timing for each word.
Example Output
[
{
"word": "The",
"start": 0.0,
"end": 0.16
},
{
"word": "whole",
"start": 0.16,
"end": 0.32
},
{
"word": "city",
"start": 0.32,
"end": 0.64
}
]
Built using torchaudio’s MMS model. Supports various audio formats and includes fallback mechanisms for robust production use.