Text-to-audio with latent diffusion
Model description
AudioLDM generates text-conditional sound effects, human speech, and music. It enables zero-shot text-guided audio style-transfer, inpainting, and super-resolution.
GitHub Demos and Project Page GitHub Repo for code
Tricks for Enhancing the Quality of Your Generated Audio
- Try to use more adjectives to describe your sound. For example: “A man is speaking clearly and slowly in a large room” is better than “A man is speaking”. This can help ensure AudioLDM understands what you want.
- Try using different random seeds, which can sometimes affect the generation quality.
- It’s better to use general terms like ‘man’ or ‘woman’ instead of specific names for individuals or abstract objects that humans may not be familiar with.
Model Authors
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumley