MusicGen with Fine-tuner
MusicGen is a simple and controllable model for music generation. With this fine-tuner implemented repository, users can fine-tune MusicGen with their own datasets. - AudioCraft 1.2.0 implemented! (Stereo models added.)
MusicGen fine-tuning instruction blog post
Fine-tune MusicGen to generate music in any style
Model Architecture and Development
MusicGen is single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods like MusicLM, MusicGen doesn’t require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, the authors show they can predict them in parallel, thus having only 50 auto-regressive steps per second of audio. They used 20K hours of licensed music to train MusicGen. Specifically, they relied on an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.
Prediction
Default Model
- The default prediction model is configured as the
melody
model. - After completing the fine-tuning process from this repository, the trained model weights will be loaded into your own model repository.
Infinite Generation
- You can set duration longer than 30 seconds.
- Due to MusicGen’s limitation of generating a maximum 30-second audio in one iteration, if the specified duration exceeds 30 seconds, the model will create multiple sequences. It will utilize the latter portion of the output from the previous generation step as the audio prompt (following the same continuation method) for the subsequent generation step.
- Infinite generation works with 1)
input_audio=None
, 2)input_audio
withcontinuation=True
, 3)input_audio
longer thanduration
as melody condition audio, which meanscontinuation=False
Fine-tuning MusicGen
For the instruction of MusicGen fine-tuning, please check the blog post : Fine-tune MusicGen to generate music in any style
Dataset
Audio
- Compressed files in formats like .zip, .tar, .gz, and .tgz are compatible for dataset uploads.
- Single audio files with .mp3, .wav, and .flac formats can also be uploaded.
- Audio files within the dataset must exceed 30 seconds in duration.
- Audio Chunking : Files surpassing 30 seconds will be divided into multiple 30-second chunks.
- Vocal Removal : If
drop_vocals
is set toTrue
, the vocal tracks in the audio files will be isolated and removed.(Default :drop_vocals = True
)- For datasets containing audio without vocals, setting
drop_vocals = False
reduces data preprocessing time and maintains audio file quality.
- For datasets containing audio without vocals, setting
Text Description
- If each audio file requires a distinct description, create a .txt file with a single-line description corresponding to each .mp3 or .wav file. (eg.
01_A_Man_Without_Love.mp3
and01_A_Man_Without_Love.txt
) - For a uniform description across all audio files, set the
one_same_description
argument to your desired description(str
). In this case, there’s no need for individual .txt files. - Auto Labeling : When
auto_labeling
is set toTrue
, labels such as ‘genre’, ‘mood’, ‘theme’, ‘instrumentation’, ‘key’, and ‘bpm’ will be generated and added to each audio file in the dataset(Default :auto_labeling = True
)
Train Parameters
Train Inputs
dataset_path
: Path = Input(“Path to dataset directory”,)one_same_description
: str = Input(description=”A description for all of audio data”, default=None)auto_labeling
: bool = Input(description=”Creating label data like genre, mood, theme, instrumentation, key, bpm for each track. Usingessentia-tensorflow
for music information retrieval.”, default=True)drop_vocals
: bool = Input(description=”Dropping the vocal tracks from the audio files in dataset, by separating sources with Demucs.”, default=True)model_version
: str = Input(description=”Model version to train.”, default=”stereo-melody”, choices=[“melody”, “small”, “medium”, “stereo-melody”, “stereo-small”, “stereo-medium”])lr
: float = Input(description=”Learning rate”, default=1)epochs
: int = Input(description=”Number of epochs to train for”, default=3)updates_per_epoch
: int = Input(description=”Number of iterations for one epoch”, default=100) If None, iterations per epoch will be set according to dataset/batch size. If there’s a value, then the number of iterations per epoch will be set as the value.batch_size
: int = Input(description=”Batch size”, default=16)
Default Parameters
- With
epochs=3
,updates_per_epoch=100
andlr=1
, it takes around 15 minutes to fine-tune the model. - For 8 gpu multiprocessing,
batch_size
must be a multiple of 8. If not,batch_size
will be automatically floored to the nearest multiple of 8. - For
medium
model, maximumbatch_size
is8
with 8 x Nvidia A40 machine setting.
Example Code
import replicate
training = replicate.trainings.create(
version="sakemin/musicgen-fine-tuner:b1ec6490e57013463006e928abc7acd8d623fe3e8321d3092e1231bf006898b1",
input={
"dataset_path":"https://your/data/path.zip",
"one_same_description":"description for your dataset music",
"epochs":3,
"updates_per_epoch":100,
"model_version":"medium",
},
destination="my-name/my-model"
)
print(training)
References
- Auto-labeling and audio chunking features are based on lyramakesmusic’s Finetune-MusicGen jupyter notebook.
- The auto-labeling feature utilizes
effnet-discogs
from MTG’sessentia
. - ‘key’ and ‘bpm’ values are obtained using
librosa
. - Vocal dropping is implemented using Meta’s
demucs
.
Licenses
- All code in this repository is licensed under the Apache License 2.0 license.
- The code in the Audiocraft repository is released under the MIT license as found in the LICENSE file.
- The weights in the Audiocraft repository are released under the CC-BY-NC 4.0 license as found in the LICENSE_weights file.