stability-ai / stable-diffusion-img2img

Generate a new image from an input image with Stable Diffusion

  • Public
  • 923.4K runs
  • GitHub
  • License

Input

Output

Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 11 seconds.

Readme

By using a diffusion-denoising mechanism as first proposed by SDEdit, Stable Diffusion is used for text-guided image-to-image translation. This model uses the weights from Stable Diffusion to generate new images from an input image using StableDiffusionImg2ImgPipeline from diffusers.

  • Developed by: Robin Rombach, Patrick Esser
  • Model type: Diffusion-based text-to-image generation model
  • Language(s): English
  • License: CreativeML Open RAIL++-M License
  • Model Description: This is a model that can be used to modify images based on a text prompt and an initial image. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (OpenCLIP-ViT/H).
  • Resources for more information: GitHub Repository.

Intended use

See stability-ai/stable-diffusion for direct use, misuse, malicious use, out-of-scope use, limitations, and bias.

Training

Training Data

The model developers used the following dataset for training the model:

  • LAION-5B and subsets (details below). The training data is further filtered using LAION’s NSFW detector, with a “p_unsafe” score of 0.1 (conservative). For more details, please refer to LAION-5B’s NeurIPS 2022 paper and reviewer discussions on the topic.

Training Procedure

Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,

  • Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
  • Text prompts are encoded through the OpenCLIP-ViT/H text-encoder.
  • The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
  • The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called v-objective, see https://arxiv.org/abs/2202.00512.