sakemin / musicgen-fine-tuner

Fine-tune MusicGen small, medium and melody models. Also stereo models available.

If you haven’t yet trained a model on Replicate, we recommend you read one of the following guides.

Pricing

Trainings for this model run on 8x Nvidia A40 (Large) GPU hardware, which costs $0.0058 per second.

Create a training

Install the Python library:

pip install replicate

Then, run this to create a training with sakemin/musicgen-fine-tuner:bc57274e as the base model:

import replicate

training = replicate.trainings.create(
  version="sakemin/musicgen-fine-tuner:bc57274e2930af17c1d692516a4e6bd67618af425db3b2107c28c2100f031934",
  input={
    ...
  },
  destination=f"{username}/<destination-model-name>"
)

print(training)
curl -s -X POST \
-d '{"destination": "{username}/<destination-model-name>", "input": {...}}' \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  https://api.replicate.com/v1/models/sakemin/musicgen-fine-tuner/versions/bc57274e2930af17c1d692516a4e6bd67618af425db3b2107c28c2100f031934/trainings

The API response will look like this:

{
  "id": "zz4ibbonubfz7carwiefibzgga",
  "version": "bc57274e2930af17c1d692516a4e6bd67618af425db3b2107c28c2100f031934",
  "status": "starting",
  "input": {
    "data": "..."
  },
  "output": null,
  "error": null,
  "logs": null,
  "started_at": null,
  "created_at": "2023-03-28T21:47:58.566434Z",
  "completed_at": null
}

Note that before you can create a training, you’ll need to create a model and use its name as the value for the destination field.

Fine-tuning MusicGen

Dataset

Audio

  • Compressed files in formats like .zip, .tar, .gz, and .tgz are compatible for dataset uploads.
  • Single audio files with .mp3, .wav, and .flac formats can also be uploaded.
  • Audio files within the dataset must exceed 5 seconds in duration.
  • Audio Chunking : Files surpassing 30 seconds will be divided into multiple 30-second chunks.
  • Vocal Removal : If drop_vocals is set to True, the vocal tracks in the audio files will be isolated and removed.(Default : drop_vocals = True)
    • For datasets containing audio without vocals, setting drop_vocals = False reduces data preprocessing time and maintains audio file quality.

Text Description

  • If each audio file requires a distinct description, create a .txt file with a single-line description corresponding to each .mp3 or .wav file. (eg. 01_A_Man_Without_Love.mp3 and 01_A_Man_Without_Love.txt)
  • For a uniform description across all audio files, set the one_same_description argument to your desired description(str). In this case, there’s no need for individual .txt files.
  • Auto Labeling : When auto_labeling is set to True, labels such as ‘genre’, ‘mood’, ‘theme’, ‘instrumentation’, ‘key’, and ‘bpm’ will be generated and added to each audio file in the dataset(Default : auto_labeling = True)

Train Parameters

Train Inputs

  • dataset_path: Path = Input(“Path to dataset directory”,)
  • one_same_description: str = Input(description=”A description for all of audio data”, default=None)
  • auto_labeling: bool = Input(description=”Creating label data like genre, mood, theme, instrumentation, key, bpm for each track. Using essentia-tensorflow for music information retrieval.”, default=True)
  • drop_vocals: bool = Input(description=”Dropping the vocal tracks from the audio files in dataset, by separating sources with Demucs.”, default=True)
  • model_version: str = Input(description=”Model version to train.”, default=”stereo-melody”, choices=[“melody”, “small”, “medium”, “stereo-melody”, “stereo-small”, “stereo-medium”])
  • lr: float = Input(description=”Learning rate”, default=1)
  • epochs: int = Input(description=”Number of epochs to train for”, default=3)
  • updates_per_epoch: int = Input(description=”Number of iterations for one epoch”, default=100) If None, iterations per epoch will be set according to dataset/batch size. If there’s a value, then the number of iterations per epoch will be set as the value.
  • batch_size: int = Input(description=”Batch size”, default=16)

Default Parameters

  • With epochs=3, updates_per_epoch=100 and lr=1, it takes around 15 minutes to fine-tune the model.
  • Setting epochs=5, updates_per_epoch=1000 and lr=0.0001 will work better preventing overfitting, but it will take more computation time.
  • For 8 gpu multiprocessing, batch_size must be a multiple of 8. If not, batch_size will be automatically floored to the nearest multiple of 8.
  • For medium model, maximum batch_size is 8 with 8 x Nvidia A40 machine setting.

Example Code

import replicate

training = replicate.trainings.create(
    version="sakemin/musicgen-fine-tuner:b1ec6490e57013463006e928abc7acd8d623fe3e8321d3092e1231bf006898b1",
  input={
    "dataset_path":"https://your/data/path.zip",
    "one_same_description":"description for your dataset music",
    "epochs":3,
    "updates_per_epoch":100,
    "model_version":"medium",
  },
  destination="my-name/my-model"
)

print(training)