sakemin / musicgen-fine-tuner

Fine-tune MusicGen small, medium and melody models. Also stereo models available.

Public
8.8K runs
GitHub
License

Run with an API

Playground API Examples Train README Versions

If you haven’t yet trained a model on Replicate, you can read one of the following guides.

Pricing

Trainings for this model run on 8x Nvidia A100 (80GB) GPU hardware, which costs $0.0112 per second.

Create a training

Install the Python library:

pip install replicate

Then, run this to create a training with sakemin/musicgen-fine-tuner:bc57274e as the base model:

import replicate

training = replicate.trainings.create(
  version="sakemin/musicgen-fine-tuner:bc57274e2930af17c1d692516a4e6bd67618af425db3b2107c28c2100f031934",
  input={
    ...
  },
  destination=f"{username}/<destination-model-name>"
)

print(training)

curl -s -X POST \
-d '{"destination": "{username}/<destination-model-name>", "input": {...}}' \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  https://api.replicate.com/v1/models/sakemin/musicgen-fine-tuner/versions/bc57274e2930af17c1d692516a4e6bd67618af425db3b2107c28c2100f031934/trainings

The API response will look like this:

{
  "id": "zz4ibbonubfz7carwiefibzgga",
  "version": "bc57274e2930af17c1d692516a4e6bd67618af425db3b2107c28c2100f031934",
  "status": "starting",
  "input": {
    "data": "..."
  },
  "output": null,
  "error": null,
  "logs": null,
  "started_at": null,
  "created_at": "2023-03-28T21:47:58.566434Z",
  "completed_at": null
}

Note that before you can create a training, you’ll need to create a model and use its name as the value for the destination field.

Fine-tuning MusicGen

Dataset

Audio

Compressed files in formats like .zip, .tar, .gz, and .tgz are compatible for dataset uploads.
Single audio files with .mp3, .wav, and .flac formats can also be uploaded.
Audio files within the dataset must exceed 5 seconds in duration.
Audio Chunking : Files surpassing 30 seconds will be divided into multiple 30-second chunks.
Vocal Removal : If drop_vocals is set to True, the vocal tracks in the audio files will be isolated and removed.(Default : drop_vocals = True)
- For datasets containing audio without vocals, setting drop_vocals = False reduces data preprocessing time and maintains audio file quality.

Text Description

If each audio file requires a distinct description, create a .txt file with a single-line description corresponding to each .mp3 or .wav file. (eg. 01_A_Man_Without_Love.mp3 and 01_A_Man_Without_Love.txt)
For a uniform description across all audio files, set the one_same_description argument to your desired description(str). In this case, there’s no need for individual .txt files.
Auto Labeling : When auto_labeling is set to True, labels such as ‘genre’, ‘mood’, ‘theme’, ‘instrumentation’, ‘key’, and ‘bpm’ will be generated and added to each audio file in the dataset(Default : auto_labeling = True)
- If you intend to use only the labels from auto_labeling, set one_same_description="".
- Available Tags of Auto-Labeling

Train Parameters

Train Inputs

dataset_path: Path = Input(“Path to dataset directory”,)
one_same_description: str = Input(description=”A description for all of audio data”, default=None)
auto_labeling: bool = Input(description=”Creating label data like genre, mood, theme, instrumentation, key, bpm for each track. Using essentia-tensorflow for music information retrieval.”, default=True)
drop_vocals: bool = Input(description=”Dropping the vocal tracks from the audio files in dataset, by separating sources with Demucs.”, default=True)
model_version: str = Input(description=”Model version to train.”, default=”stereo-melody”, choices=[“melody”, “small”, “medium”, “stereo-melody”, “stereo-small”, “stereo-medium”])
lr: float = Input(description=”Learning rate”, default=1)
epochs: int = Input(description=”Number of epochs to train for”, default=3)
updates_per_epoch: int = Input(description=”Number of iterations for one epoch”, default=100) If None, iterations per epoch will be set according to dataset/batch size. If there’s a value, then the number of iterations per epoch will be set as the value.
batch_size: int = Input(description=”Batch size”, default=16)

Default Parameters

With epochs=3, updates_per_epoch=100 and lr=1, it takes around 15 minutes to fine-tune the model.
Setting epochs=5, updates_per_epoch=1000 and lr=0.0001 will work better preventing overfitting, but it will take more computation time.
For 8 gpu multiprocessing, batch_size must be a multiple of 8. If not, batch_size will be automatically floored to the nearest multiple of 8.
For medium model, maximum batch_size is 8 with 8 x Nvidia A40 machine setting.

Example Code

import replicate

training = replicate.trainings.create(
    version="sakemin/musicgen-fine-tuner:b1ec6490e57013463006e928abc7acd8d623fe3e8321d3092e1231bf006898b1",
  input={
    "dataset_path":"https://your/data/path.zip",
    "one_same_description":"description for your dataset music",
    "epochs":3,
    "updates_per_epoch":100,
    "model_version":"medium",
  },
  destination="my-name/my-model"
)

print(training)