daanelson/attend-and-excite

Public Attention-Based Semantic Guidance for Text-to-Image Diffusion Models
Demo API Examples Versions (27a10b28)

Examples

View more examples

Run time and cost

Predictions run on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 34 seconds. The predict time for this model varies significantly based on the inputs.

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Given a pre-trained text-to-image diffusion model (e.g., Stable Diffusion) our method, Attend-and-Excite, guides the generative model to modify the cross-attention values during the image synthesis process to generate images that more faithfully depict the input text prompt. Stable Diffusion alone (top row) struggles to generate multiple objects (e.g., a horse and a dog). However, by incorporating Attend-and-Excite (bottom row) to strengthen the subject tokens (marked in blue), we achieve images that are more semantically faithful with respect to the input text prompts.

Hugging Face Spaces

## Acknowledgements This code is builds on the code from the [diffusers](https://github.com/huggingface/diffusers) library as well as the [Prompt-to-Prompt](https://github.com/google/prompt-to-prompt/) codebase.