This runs preprocessing code to generate a dataset you can use to fine-tune Whisper. Specifically, it takes as input either:
- two tarballs - one of audio files and one of text files. The transcription for a given audio file should have the same base name - i.e audio1.mp3 corresponds to audio1.txt.
OR
- A jsonl file (named <some_file.txt>, which contains lines like so:
...
{"audio": <URL of audio file>, "sentence": <URL of transcription>}
{"audio": <URL of audio file>, "sentence": <URL of transcription>}
...