Fine-tune MusicGen to generate music in any style

Posted by @fofr and @sakemin

Fine-tune MusicGen if you want to generate music in a certain style. Whether that’s 16bit video game chip-tunes, or the calmness of something choral.

A full model training takes 15 minutes using 8x A40 (Large) hardware. You can run your fine-tuned model from the web or using the cloud API, or you can download the fine-tuned model weights for use in other contexts.

The fine-tune process was developed by Jongmin Jung (aka. sake). It’s based on Meta’s AudioCraft and their built-in trainer Dora. To make training simple Sake has included automatic audio chunking, auto-labeling, and vocal removal features. Your trained model can also generate music longer than 30 seconds.

Here is an example of a choral fine-tune combined with a 16bit video game (as a continuation):

Prepare your music dataset

Just a few tracks (9-10) are enough to fine-tune MusicGen on a musical style.

Your tracks must each be longer than 30 seconds. The training script will automatically split long audio files into 30 second chunks for training.

Label your training data

You can label your music in three ways:

  1. Let the training script label automatically. Genre, mood, theme, instrumentation, key, and bpm will be generated using essentia and used for each audio file (this is the default).
  2. Give a single description for all tracks using the one_same_description training parameter. This works alongside automatic labeling and will be added at the beginning.
  3. Give each audio file your own description.

If you want to use your own descriptions you must include a text file with the same filename as each track. For example, 01_A_Man_Without_Love.mp3 would need a text file named 01_A_Man_Without_Love.txt. Inside each text file put a single line description.

Vocals are removed before training

MusicGen does not work well with music containing vocals. The base models contain no vocals at all. If you were to train using tracks with singing or speech it would lead to weird sounding outputs. By default we’ll strip all vocals from audio files.

If your tracks do not contain vocals, or if you want to try training with vocals anyway, you can disable the feature by setting drop_vocals to false in your training parameters (see below).

Pick a model to train

You can train either the small, medium or melody models. Small is the default. The large model cannot be trained.

If you choose the small or medium models, you can generate music longer than 30 seconds using automatic continuation. The melody model is limited to 30 seconds.

The melody model lets you generate music based on the melody of an input. This feature is only available in your fine-tune if you’ve chosen to train the melody base model.

Add your Replicate API Token

Before starting the training job you need to grab your Replicate API token from replicate.com/account/api-tokens.

In your shell, store that token in an environment variable called REPLICATE_API_TOKEN.

export REPLICATE_API_TOKEN=r8_...

Create a model

You also need to create a model on Replicate that will be the destination for the trained MusicGen version. Go to replicate.com/create to create the model.

In the example below we call it my-name/my-model.

Upload your training data

Put your tracks (and any text files) in a folder and zip them up.

If you’re using the Replicate CLI, you can upload your training files as part of your training command (see below).

Otherwise you’ll need to upload your zip file somewhere on the internet that is publicly accessible, like an S3 bucket or a GitHub Pages site.

If you like, you can use our API for uploading files. Run these commands:

RESPONSE=$(curl -s -X POST -H "Authorization: Bearer $REPLICATE_API_TOKEN" https://dreambooth-api-experimental.replicate.com/v1/upload/data.zip)

curl -X PUT -H "Content-Type: application/zip" --upload-file data.zip "$(jq -r ".upload_url" <<< "$RESPONSE")"

SERVING_URL=$(jq -r ".serving_url" <<< $RESPONSE)
echo $SERVING_URL

This will print out a URL to your uploaded zip file. Copy the URL so you can use it as the dataset_path input parameter when starting the training job.

Start the training

To train MusicGen, run the following command from Python:

import replicate

training = replicate.trainings.create(
    version="sakemin/musicgen-fine-tuner:8d02c56b9a3d69abd2f1d6cc1a65027de5bfef7f0d34bd23e0624ecabb65acac",
    input={
        "dataset_path": "https://my-domain/my-audio-files.zip",
    },
    destination="my-name/my-model"
)

If Python isn’t your language of choice, we support several other languages as well.

Or if you’re using the Replicate CLI, you can upload your local dataset and start your training with:

replicate train sakemin/musicgen-fine-tuner \
  --destination my-name/my-model \
  dataset_path=@audio.zip

These will train the small MusicGen model using the default parameters (see details about training parameters below).

Monitor training progress

To follow the progress of the training job, visit replicate.com/trainings or inspect the training programmatically:

training.reload()
print(training.status)
print("\n".join(training.logs.split("\n")[-10:]))

Run the model

When the model has finished training you can run it on the web, or with an API:

output = replicate.run(
    "my-name/my-model:abcde1234...",
    input={"prompt": "your new musical style"},
)

If you gave your own descriptions during training, make sure you reuse them in your prompt. Otherwise use a prompt that best describes the new style, or check the training logs to see which labels were automatically added.

For example, in our choral fine-tune, we used the description “sacred chamber choir, choral”. Reusing that description in our prompt brings out the style clearly. We’ve also found that using just part of the training description, ‘choir’ or ‘choral’, keeps the style but reduces the strength of the effect.

That’s it! You’ve now got an infinite music generator in your style.

All fine-tune settings

MusicGen fine-tuning comes with parameters to give you control over your trained model:

  • dataset_path: A URL pointing to your zip or audio file
  • one_same_description: A description for all audio data (default: none)
  • auto_labeling: Creates label data like genre, mood, theme, instrumentation, key, bpm for each track. Using essentia-tensorflow for music information retrieval (default: true)
  • drop_vocals: Drops vocal tracks from audio files in the dataset, by separating sources with Demucs (default: true)
  • model_version: The model version to train, choices are “melody”, “small”, “medium” (default: “small”)
  • lr: Learning rate (default: 1)
  • epochs: Number of epochs to train for (default: 3)
  • updates_per_epoch: Number of iterations for one epoch (default: 100). If set to None, iterations per epoch will be determined automatically based on batch size and dataset.
  • batch_size: Batch size, must be a multiple of 8 (default: 16)

What’s next?