pix2pix-zero
This is author’s reimplementation of “Zero-shot Image-to-Image Translation” using the diffusers library.
The results in the paper are based on the CompVis library, which will be released later.
TL;DR: no finetuning required, no text input needed, input structure preserved.
Results
All our results are based on stable-diffusion-v1-4 model. Please the website for more results.
Method Details
Given an input image, we first generate text captions using BLIP and apply regularized DDIM inversion to obtain our inverted noise map. Then, we obtain reference cross-attention maps that correspoind to the structure of the input image by denoising, guided with the CLIP embeddings of our generated text (c). Next, we denoise with edited text embeddings, while enforcing a loss to match current cross-attention maps with the reference cross-attention maps.