daanelson/whisper-tune – Run with an API on Replicate

Readme

Model Description

This is a fine-tuneable version of Whisper, useful for applications in underresourced languages or with domain-specific audio. Currently it supports tiny (useful for testing), small, medium, and large-v2.

Fine-tuning

If you have access to the training beta, you can fine-tune this model.

Here’s an example using replicate-python:

training = replicate.trainings.create(
    version="daanelson/whisper-tune:a6ca247c2d2d26c6adb7da7694d63cea1702ecb7c3f84ce174e449416f40b2f9", 
    input={
        "train_data": "https://storage.googleapis.com/dan-scratch-public/whisper_test/en_training_data.zip", 
    }, 
    destination="my-username/my-model"
)

Training takes these input parameters:

train_data (required): URL to a zip file containing train dataset. Dataset should be a zipped HuggingFace Dataset. Each row in the dataset contains two items - audio, which consists of preprocessed audio sampled to 16kHz, and sentence, which is a text transcription of the audio. See this processing script for an example of how to parse a dataset into this format.

NOTE: we do not currently support passing a jsonl dataset with a list of audio & transcription URLs, but if you’d like that reach out in discord or email team@ and we’d be happy to add.

eval_data (optional): URL to a zip file containing an eval dataset, in the same format as train_data. Note that this dataset is optional; if you don’t provide then no evaluation will be done.
model_name (optional, default=’small’): Name of the whisper model to fine-tune. Options are [tiny, small, medium, large-v2]
whisper_language (optional): Language of audio data if dataset is monolingual.
per_device_train_batch_size (optional, default=16): Train batch size.
gradient_accumulation_steps (optional, default=1): Number of training steps (each of train_batch_size) to store gradients for before performing an optimizer step. gradient_accumulation_steps * per_device_train_batch_size = total effective batch size
learning_rate (optional, default=2e-5): Learning rate!
num_train_epochs (optional, default=1): Number of epochs (iterations over the entire training dataset) to train for.
warmup_ratio (optional, default=0.03): Percentage of all training steps used for a linear LR warmup.
logging_steps (optional, default=1): Prints loss & other logging info every logging_steps.
max_steps (optional, default=-1): Maximum number of training steps. Unlimited if max_steps=-1. Overrides num_train_epochs, useful for testing.

This training code is implemented with the HuggingFace API; see github for more details.

Inference

The fine-tuned model will perform inference on input audio as anticipated. Without fine-tuning, this model defaults to whisper-small, but it is not optimized for that inference. If you’re just interested in inference with whisper, check out the standard whisper model on Replicate.