cjwbw / demucs

Demucs Music Source Separation

  • Public
  • 142.9K runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.029 to run on Replicate, or 34 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 131 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Demucs Music Source Separation

Demucs is a state-of-the-art music source separation model, currently capable of separating drums, bass, and vocals from the rest of the accompaniment. Demucs is based on a U-Net convolutional architecture inspired by [Wave-U-Net][waveunet]. The v4 version features [Hybrid Transformer Demucs][htdemucs], a hybrid spectrogram/waveform separation model using Transformers. It is based on [Hybrid Demucs][hybrid_paper] (also provided in this repo) with the innermost layers are replaced by a cross-domain Transformer Encoder. This Transformer uses self-attention within each domain, and cross-attention across domains. The model achieves a SDR of 9.00 dB on the MUSDB HQ test set. Moreover, when using sparse attention kernels to extend its receptive field and per source fine-tuning, we achieve state-of-the-art 9.20 dB of SDR.

Samples are available on our sample page. Checkout [our paper][htdemucs] for more information. It has been trained on the [MUSDB HQ][musdb] dataset + an extra training dataset of 800 songs. This model separates drums, bass and vocals and other stems for any song.

As Hybrid Transformer Demucs is brand new, it is not activated by default, you can activate it in the usual commands described hereafter with -n htdemucs_ft. The single, non fine-tuned model is provided as -n htdemucs, and the retrained baseline as -n hdemucs_mmi. The Sparse Hybrid Transformer model decribed in our paper is not provided as its requires custom CUDA code that is not ready for release yet. We are also releasing an experimental 6 sources model, that adds a guitar and piano source. Quick testing seems to show okay quality for guitar, but a lot of bleeding and artifacts for the piano source.

The list of pre-trained models is:
- htdemucs: first version of Hybrid Transformer Demucs. Trained on MusDB + 800 songs. Default model.
- htdemucs_ft: fine-tuned version of htdemucs, separation will take 4 times more time but might be a bit better. Same training set as htdemucs.
- htdemucs_6s: 6 sources version of htdemucs, with piano and guitar being added as sources. Note that the piano source is not working great at the moment.
- hdemucs_mmi: Hybrid Demucs v3, retrained on MusDB + 800 songs.
- mdx: trained only on MusDB HQ, winning model on track A at the [MDX][mdx] challenge.
- mdx_extra: trained with extra training data (including MusDB test set), ranked 2nd on the track B of the [MDX][mdx] challenge.
- mdx_q, mdx_extra_q: quantized version of the previous models. Smaller download and storage but quality can be slightly worse.

Schema representing the structure of Hybrid Transformer Demucs,
    with a dual U-Net structure, one branch for the temporal domain,
    and one branch for the spectral domain. There is a cross-domain Transformer between the Encoders and Decoders.

How to cite

@inproceedings{rouard2022hybrid,
  title={Hybrid Transformers for Music Source Separation},
  author={Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
  booktitle={ICASSP 23},
  year={2023}
}

@inproceedings{defossez2021hybrid,
  title={Hybrid Spectrogram and Waveform Source Separation},
  author={D{\'e}fossez, Alexandre},
  booktitle={Proceedings of the ISMIR 2021 Workshop on Music Source Separation},
  year={2021}
}