sakemin / musicgen-fine-tuner

Fine-tune MusicGen small, medium and melody models. Also stereo models available.

  • Public
  • 7.9K runs
  • L40S
  • GitHub
  • License

Input

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
string
Shift + Return to add a new line

A description of the music you want to generate.

file

An audio file that will influence the generated music. If `continuation` is `True`, the generated music will be a continuation of the audio file. Otherwise, the generated music will mimic the audio file's melody.

integer

Duration of the generated audio in seconds.

Default: 8

boolean

If `True`, generated music will continue `melody`. Otherwise, generated music will mimic `audio_input`'s melody.

Default: false

integer
(minimum: 0)

Start time of the audio file to use for continuation.

Default: 0

integer
(minimum: 0)

End time of the audio file to use for continuation. If -1 or None, will default to the end of the audio clip.

boolean

If `True`, the EnCodec tokens will be decoded with MultiBand Diffusion. Only works with non-stereo models.

Default: false

string

Strategy for normalizing audio.

Default: "loudness"

integer

Reduces sampling to the k most likely tokens.

Default: 250

number

Reduces sampling to tokens with cumulative probability of p. When set to `0` (default), top_k sampling is used.

Default: 0

number

Controls the 'conservativeness' of the sampling process. Higher temperature means more diversity.

Default: 1

integer

Increases the influence of inputs on the output. Higher values produce lower-varience outputs that adhere more closely to inputs.

Default: 3

string

Output format for generated audio.

Default: "wav"

integer

Seed for random number generator. If None or -1, a random seed will be used.

Output

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
Generated in

This output was created using a different version of the model, sakemin/musicgen-fine-tuner:b1ec6490.

Run time and cost

This model costs approximately $0.11 to run on Replicate, or 9 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 117 seconds. The predict time for this model varies significantly based on the inputs.

Readme

MusicGen with Fine-tuner

MusicGen is a simple and controllable model for music generation. With this fine-tuner implemented repository, users can fine-tune MusicGen with their own datasets. - AudioCraft 1.2.0 implemented! (Stereo models added.)

MusicGen fine-tuning instruction blog post

Fine-tune MusicGen to generate music in any style

Model Architecture and Development

MusicGen is single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods like MusicLM, MusicGen doesn’t require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, the authors show they can predict them in parallel, thus having only 50 auto-regressive steps per second of audio. They used 20K hours of licensed music to train MusicGen. Specifically, they relied on an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.

Prediction

Default Model

  • The default prediction model is configured as the melody model.
  • After completing the fine-tuning process from this repository, the trained model weights will be loaded into your own model repository.

Infinite Generation

  • You can set duration longer than 30 seconds.
  • Due to MusicGen’s limitation of generating a maximum 30-second audio in one iteration, if the specified duration exceeds 30 seconds, the model will create multiple sequences. It will utilize the latter portion of the output from the previous generation step as the audio prompt (following the same continuation method) for the subsequent generation step.
  • Infinite generation works with 1) input_audio=None, 2) input_audio with continuation=True, 3) input_audio longer than duration as melody condition audio, which means continuation=False

Fine-tuning MusicGen

For the instruction of MusicGen fine-tuning, please check the blog post : Fine-tune MusicGen to generate music in any style

Dataset

Audio

  • Compressed files in formats like .zip, .tar, .gz, and .tgz are compatible for dataset uploads.
  • Single audio files with .mp3, .wav, and .flac formats can also be uploaded.
  • Audio files within the dataset must exceed 30 seconds in duration.
  • Audio Chunking : Files surpassing 30 seconds will be divided into multiple 30-second chunks.
  • Vocal Removal : If drop_vocals is set to True, the vocal tracks in the audio files will be isolated and removed.(Default : drop_vocals = True)
    • For datasets containing audio without vocals, setting drop_vocals = False reduces data preprocessing time and maintains audio file quality.

Text Description

  • If each audio file requires a distinct description, create a .txt file with a single-line description corresponding to each .mp3 or .wav file. (eg. 01_A_Man_Without_Love.mp3 and 01_A_Man_Without_Love.txt)
  • For a uniform description across all audio files, set the one_same_description argument to your desired description(str). In this case, there’s no need for individual .txt files.
  • Auto Labeling : When auto_labeling is set to True, labels such as ‘genre’, ‘mood’, ‘theme’, ‘instrumentation’, ‘key’, and ‘bpm’ will be generated and added to each audio file in the dataset(Default : auto_labeling = True)

Train Parameters

Train Inputs

  • dataset_path: Path = Input(“Path to dataset directory”,)
  • one_same_description: str = Input(description=”A description for all of audio data”, default=None)
  • auto_labeling: bool = Input(description=”Creating label data like genre, mood, theme, instrumentation, key, bpm for each track. Using essentia-tensorflow for music information retrieval.”, default=True)
  • drop_vocals: bool = Input(description=”Dropping the vocal tracks from the audio files in dataset, by separating sources with Demucs.”, default=True)
  • model_version: str = Input(description=”Model version to train.”, default=”stereo-melody”, choices=[“melody”, “small”, “medium”, “stereo-melody”, “stereo-small”, “stereo-medium”])
  • lr: float = Input(description=”Learning rate”, default=1)
  • epochs: int = Input(description=”Number of epochs to train for”, default=3)
  • updates_per_epoch: int = Input(description=”Number of iterations for one epoch”, default=100) If None, iterations per epoch will be set according to dataset/batch size. If there’s a value, then the number of iterations per epoch will be set as the value.
  • batch_size: int = Input(description=”Batch size”, default=16)

Default Parameters

  • With epochs=3, updates_per_epoch=100 and lr=1, it takes around 15 minutes to fine-tune the model.
  • For 8 gpu multiprocessing, batch_size must be a multiple of 8. If not, batch_size will be automatically floored to the nearest multiple of 8.
  • For medium model, maximum batch_size is 8 with 8 x Nvidia A40 machine setting.

Example Code

import replicate

training = replicate.trainings.create(
    version="sakemin/musicgen-fine-tuner:b1ec6490e57013463006e928abc7acd8d623fe3e8321d3092e1231bf006898b1",
  input={
    "dataset_path":"https://your/data/path.zip",
    "one_same_description":"description for your dataset music",
    "epochs":3,
    "updates_per_epoch":100,
    "model_version":"medium",
  },
  destination="my-name/my-model"
)

print(training)

References

Licenses