Text-to-audio generation with latent diffusion models

Run time and cost

Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 56 seconds. The predict time for this model varies significantly based on the inputs.

Text-to-audio with latent diffusion

Model description

AudioLDM generates text-conditional sound effects, human speech, and music. It enables zero-shot text-guided audio style-transfer, inpainting, and super-resolution.

GitHub Demos and Project Page
GitHub Repo for code

Tricks for Enhancing the Quality of Your Generated Audio

  1. Try to use more adjectives to describe your sound. For example: "A man is speaking clearly and slowly in a large room" is better than "A man is speaking". This can help ensure AudioLDM understands what you want.
  2. Try using different random seeds, which can sometimes affect the generation quality.
  3. It's better to use general terms like 'man' or 'woman' instead of specific names for individuals or abstract objects that humans may not be familiar with.

Model Authors

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumley