chenxwh / pix2pix-zero

Zero-shot Image-to-Image Translation

  • Public
  • 5.8K runs
  • GitHub
  • License



Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 4 minutes. The predict time for this model varies significantly based on the inputs.



This is author’s reimplementation of “Zero-shot Image-to-Image Translation” using the diffusers library.
The results in the paper are based on the CompVis library, which will be released later.

TL;DR: no finetuning required, no text input needed, input structure preserved.


All our results are based on stable-diffusion-v1-4 model. Please the website for more results.

Method Details

Given an input image, we first generate text captions using BLIP and apply regularized DDIM inversion to obtain our inverted noise map. Then, we obtain reference cross-attention maps that correspoind to the structure of the input image by denoising, guided with the CLIP embeddings of our generated text (c). Next, we denoise with edited text embeddings, while enforcing a loss to match current cross-attention maps with the reference cross-attention maps.