stability-ai / stable-diffusion-img2img

Generate a new image from an input image with Stable Diffusion

  • Public
  • 981.8K runs
  • A100 (80GB)
  • GitHub
  • License

Input

image
string
Shift + Return to add a new line

Input prompt

Default: "A fantasy landscape, trending on artstation"

string
Shift + Return to add a new line

The prompt NOT to guide the image generation. Ignored when not using guidance

*file

Inital image to generate variations of.

number

Prompt strength when providing the image. 1.0 corresponds to full destruction of information in init image

Default: 0.8

integer
(minimum: 1, maximum: 8)

Number of images to output. Higher number of outputs may OOM.

Default: 1

integer
(minimum: 1, maximum: 500)

Number of denoising steps

Default: 25

number
(minimum: 1, maximum: 20)

Scale for classifier-free guidance

Default: 7.5

string

Choose a scheduler.

Default: "DPMSolverMultistep"

integer

Random seed. Leave blank to randomize the seed

Output

output
Generated in

This output was created using a different version of the model, stability-ai/stable-diffusion-img2img:ddd4eb44.

Run time and cost

This model costs approximately $0.0093 to run on Replicate, or 107 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 7 seconds. The predict time for this model varies significantly based on the inputs.

Readme

By using a diffusion-denoising mechanism as first proposed by SDEdit, Stable Diffusion is used for text-guided image-to-image translation. This model uses the weights from Stable Diffusion to generate new images from an input image using StableDiffusionImg2ImgPipeline from diffusers.

  • Developed by: Robin Rombach, Patrick Esser
  • Model type: Diffusion-based text-to-image generation model
  • Language(s): English
  • License: CreativeML Open RAIL++-M License
  • Model Description: This is a model that can be used to modify images based on a text prompt and an initial image. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (OpenCLIP-ViT/H).
  • Resources for more information: GitHub Repository.

Intended use

See stability-ai/stable-diffusion for direct use, misuse, malicious use, out-of-scope use, limitations, and bias.

Training

Training Data

The model developers used the following dataset for training the model:

  • LAION-5B and subsets (details below). The training data is further filtered using LAION’s NSFW detector, with a “p_unsafe” score of 0.1 (conservative). For more details, please refer to LAION-5B’s NeurIPS 2022 paper and reviewer discussions on the topic.

Training Procedure

Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,

  • Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
  • Text prompts are encoded through the OpenCLIP-ViT/H text-encoder.
  • The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
  • The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called v-objective, see https://arxiv.org/abs/2202.00512.