chenxwh / audiosep

Separate Anything You Describe

  • Public
  • 2.4K runs
  • GitHub
  • Paper
  • License



Run time and cost

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 4 seconds. The predict time for this model varies significantly based on the inputs.


Separate Anything You Describe

We introduce AudioSep, a foundation model for open-domain sound separation with natural language queries. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability on numerous tasks such as audio event separation, musical instrument separation, and speech enhancement. Check the separated audio examples in the Demo Page!

Cite this work

If you found this tool useful, please consider citing

  title={Separate Anything You Describe},
  author={Liu, Xubo and Kong, Qiuqiang and Zhao, Yan and Liu, Haohe and Yuan, Yi and Liu, Yuzhuo and Xia, Rui and Wang, Yuxuan and Plumbley, Mark D and Wang, Wenwu},
  journal={arXiv preprint arXiv:2308.05037},
  title={Separate What You Describe: Language-Queried Audio Source Separation},
  author={Liu, Xubo and Liu, Haohe and Kong, Qiuqiang and Mei, Xinhao and Zhao, Jinzheng and Huang, Qiushi and Plumbley, Mark D and Wang, Wenwu},
  booktitle={Proc. Interspeech},