Stable Diffusion XL specifically trained on Inpainting by huggingface
Get the image embeddings from segement anything
Generate sounds from a text prompt