sakemin
/
musicgen-fine-tuner
Fine-tune MusicGen small, medium and melody models. Also stereo models available.
If you haven’t yet trained a model on Replicate, we recommend you read one of the following guides.
Pricing
Trainings for this model run on 8x Nvidia A40 (Large) GPU hardware, which costs $0.0058 per second.
Create a training
Note that before you can create a training, you’ll need to create a model and use its name as the value for the destination field.
Fine-tuning MusicGen
Dataset
Audio
- Compressed files in formats like .zip, .tar, .gz, and .tgz are compatible for dataset uploads.
- Single audio files with .mp3, .wav, and .flac formats can also be uploaded.
- Audio files within the dataset must exceed 5 seconds in duration.
- Audio Chunking : Files surpassing 30 seconds will be divided into multiple 30-second chunks.
- Vocal Removal : If
drop_vocals
is set toTrue
, the vocal tracks in the audio files will be isolated and removed.(Default :drop_vocals = True
)- For datasets containing audio without vocals, setting
drop_vocals = False
reduces data preprocessing time and maintains audio file quality.
- For datasets containing audio without vocals, setting
Text Description
- If each audio file requires a distinct description, create a .txt file with a single-line description corresponding to each .mp3 or .wav file. (eg.
01_A_Man_Without_Love.mp3
and01_A_Man_Without_Love.txt
) - For a uniform description across all audio files, set the
one_same_description
argument to your desired description(str
). In this case, there’s no need for individual .txt files. - Auto Labeling : When
auto_labeling
is set toTrue
, labels such as ‘genre’, ‘mood’, ‘theme’, ‘instrumentation’, ‘key’, and ‘bpm’ will be generated and added to each audio file in the dataset(Default :auto_labeling = True
)- If you intend to use only the labels from
auto_labeling
, setone_same_description=""
. - Available Tags of Auto-Labeling
- If you intend to use only the labels from
Train Parameters
Train Inputs
dataset_path
: Path = Input(“Path to dataset directory”,)one_same_description
: str = Input(description=”A description for all of audio data”, default=None)auto_labeling
: bool = Input(description=”Creating label data like genre, mood, theme, instrumentation, key, bpm for each track. Usingessentia-tensorflow
for music information retrieval.”, default=True)drop_vocals
: bool = Input(description=”Dropping the vocal tracks from the audio files in dataset, by separating sources with Demucs.”, default=True)model_version
: str = Input(description=”Model version to train.”, default=”stereo-melody”, choices=[“melody”, “small”, “medium”, “stereo-melody”, “stereo-small”, “stereo-medium”])lr
: float = Input(description=”Learning rate”, default=1)epochs
: int = Input(description=”Number of epochs to train for”, default=3)updates_per_epoch
: int = Input(description=”Number of iterations for one epoch”, default=100) If None, iterations per epoch will be set according to dataset/batch size. If there’s a value, then the number of iterations per epoch will be set as the value.batch_size
: int = Input(description=”Batch size”, default=16)
Default Parameters
- With
epochs=3
,updates_per_epoch=100
andlr=1
, it takes around 15 minutes to fine-tune the model. - Setting
epochs=5
,updates_per_epoch=1000
andlr=0.0001
will work better preventing overfitting, but it will take more computation time. - For 8 gpu multiprocessing,
batch_size
must be a multiple of 8. If not,batch_size
will be automatically floored to the nearest multiple of 8. - For
medium
model, maximumbatch_size
is8
with 8 x Nvidia A40 machine setting.
Example Code
import replicate
training = replicate.trainings.create(
version="sakemin/musicgen-fine-tuner:b1ec6490e57013463006e928abc7acd8d623fe3e8321d3092e1231bf006898b1",
input={
"dataset_path":"https://your/data/path.zip",
"one_same_description":"description for your dataset music",
"epochs":3,
"updates_per_epoch":100,
"model_version":"medium",
},
destination="my-name/my-model"
)
print(training)