Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

[Project Page]

The model

This model allows you to edit images using text by performing text-guided image to image translation. You can either provide your own image or use another text prompt to generate an initial image with Stable Diffusion and then translate it using the translation prompt.

To translate your own image, set the input_image argument and leave generation_prompt empty.
To first generate an image from text, leave input_image empty and your text prompt at generation_prompt. In this case, the generated input is returned as the first output.

1. Feature extraction

From input_image: The input image is first inverted, outputting a noise-map that can be transformed into the original image using stable-diffusion. The intermediate stable-diffusion features for this generation ares saved
From text: An image is generated by stable-diffusion by the text-prompt and the intermediate features are saved.

2. Image translation

A new translated image is generated using the translation text and the saved spacial features In the config parameters, you can control the following aspects in the translation:

Structure preservation can be controlled by the feature_injection_threshold parameter (a higher value allows better structure preservation but can also leak details from the source image, ~80% of the total sampling steps generally gives a good tradeoff).
Deviation from the guidance image can be controlled through the scale, negative_prompt_alpha and negative_prompt_schedule parameters (see the sample config files for details). The effect of negative prompting is minor in case of realistic guidance images, but it can significantly help in case of minimalistic and abstract guidance images (e.g. segmentations).

Note that you can run a batch of translations by providing multiple target prompts in the prompts parameter.