Fine-tune MusicGen if you want to generate music in a certain style. Whether that’s 16bit video game chip-tunes, or the calmness of something choral.
A full model training takes 15 minutes using 8x A40 (Large) hardware. You can run your fine-tuned model from the web or using the cloud API, or you can download the fine-tuned model weights for use in other contexts.
The fine-tune process was developed by Jongmin Jung (aka. sake). It’s based on Meta’s AudioCraft and their built-in trainer Dora. To make training simple Sake has included automatic audio chunking, auto-labeling, and vocal removal features. Your trained model can also generate music longer than 30 seconds.
Here is an example of a choral fine-tune combined with a 16bit video game (as a continuation):
Just a few tracks (9-10) are enough to fine-tune MusicGen on a musical style.
Your tracks must each be longer than 30 seconds. The training script will automatically split long audio files into 30 second chunks for training.
You can label your music in three ways:
essentia
and used for each audio file (this is the default).one_same_description
training parameter. This works alongside automatic labeling and will be added at the beginning.If you want to use your own descriptions you must include a text file with the same filename as each track. For example, 01_A_Man_Without_Love.mp3
would need a text file named 01_A_Man_Without_Love.txt
. Inside each text file put a single line description.
MusicGen does not work well with music containing vocals. The base models contain no vocals at all. If you were to train using tracks with singing or speech it would lead to weird sounding outputs. By default we’ll strip all vocals from audio files.
If your tracks do not contain vocals, or if you want to try training with vocals anyway, you can disable the feature by setting drop_vocals
to false
in your training parameters (see below).
You can train either the small
, medium
or melody
models. Small is the default. The large model cannot be trained.
If you choose the small or medium models, you can generate music longer than 30 seconds using automatic continuation. The melody model is limited to 30 seconds.
The melody model lets you generate music based on the melody of an input. This feature is only available in your fine-tune if you’ve chosen to train the melody base model.
Before starting the training job you need to grab your Replicate API token from replicate.com/account/api-tokens.
In your shell, store that token in an environment variable called REPLICATE_API_TOKEN
.
You also need to create a model on Replicate that will be the destination for the trained MusicGen version. Go to replicate.com/create to create the model.
In the example below we call it my-name/my-model
.
Put your tracks (and any text files) in a folder and zip them up.
If you’re using the Replicate CLI, you can upload your training files as part of your training command (see below).
Otherwise you’ll need to upload your zip file somewhere on the internet that is publicly accessible, like an S3 bucket or a GitHub Pages site.
To train MusicGen, run the following command from Python:
If Python isn’t your language of choice, we support several other languages as well.
Or if you’re using the Replicate CLI, you can upload your local dataset and start your training with:
These will train the small
MusicGen model using the default parameters (see details about training parameters below).
To follow the progress of the training job, visit replicate.com/trainings or inspect the training programmatically:
When the model has finished training you can run it on the web, or with an API:
If you gave your own descriptions during training, make sure you reuse them in your prompt. Otherwise use a prompt that best describes the new style, or check the training logs to see which labels were automatically added.
For example, in our choral fine-tune, we used the description "sacred chamber choir, choral". Reusing that description in our prompt brings out the style clearly. We’ve also found that using just part of the training description, ‘choir’ or ‘choral’, keeps the style but reduces the strength of the effect.
That's it! You've now got an infinite music generator in your style.
MusicGen fine-tuning comes with parameters to give you control over your trained model:
dataset_path
: A URL pointing to your zip or audio fileone_same_description
: A description for all audio data (default: none)auto_labeling
: Creates label data like genre, mood, theme, instrumentation, key, bpm for each track. Using essentia-tensorflow
for music information retrieval (default: true)drop_vocals
: Drops vocal tracks from audio files in the dataset, by separating sources with Demucs (default: true)model_version
: The model version to train, choices are “melody”, “small”, “medium” (default: "small")lr
: Learning rate (default: 1)epochs
: Number of epochs to train for (default: 3)updates_per_epoch
: Number of iterations for one epoch (default: 100). If set to None, iterations per epoch will be determined automatically based on batch size and dataset.batch_size
: Batch size, must be a multiple of 8 (default: 16)