stability-ai / stable-diffusion-inpainting

Fill in masked parts of images with Stable Diffusion

  • Public
  • 19.5M runs
  • A100 (80GB)
  • GitHub
  • License

Input

string
Shift + Return to add a new line

Input prompt

Default: "a vision of paradise. unreal engine"

*file
Preview
image

Initial image to generate variations of. Will be resized to height x width

*file
Preview
mask

Black and white image to use as mask for inpainting over the image provided. White pixels are inpainted and black pixels are preserved.

integer

Height of generated image in pixels. Needs to be a multiple of 64

Default: 512

integer

Width of generated image in pixels. Needs to be a multiple of 64

Default: 512

string
Shift + Return to add a new line

Specify things to not see in the output

integer
(minimum: 1, maximum: 4)

Number of images to generate.

Default: 1

integer
(minimum: 1, maximum: 500)

Number of denoising steps

Default: 50

number
(minimum: 1, maximum: 20)

Scale for classifier-free guidance

Default: 7.5

string

Choose a scheduler.

Default: "DPMSolverMultistep"

integer

Random seed. Leave blank to randomize the seed

boolean

This model’s safety checker can’t be disabled when running on the website. Learn more about platform safety on Replicate.

Disable safety checker for generated images. This feature is only available through the API. See [https://replicate.com/docs/how-does-replicate-work#safety](https://replicate.com/docs/how-does-replicate-work#safety)

Default: false

Output

output
Generated in

This example was created by a different version, stability-ai/stable-diffusion-inpainting:e5a34f91.

Run time and cost

This model costs approximately $0.0023 to run on Replicate, or 434 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 2 seconds.

Readme

This stable-diffusion-2-inpainting model is resumed from stable-diffusion-2-base (512-base-ema.ckpt) and trained for another 200k steps. Follows the mask-generation strategy presented in LAMA which, in combination with the latent VAE representations of the masked image, are used as an additional conditioning.

  • Developed by: Robin Rombach, Patrick Esser
  • Model type: Diffusion-based text-to-image generation model
  • Language(s): English
  • License: CreativeML Open RAIL++-M License
  • Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (OpenCLIP-ViT/H).
  • Resources for more information: GitHub Repository.

Intended use

See stability-ai/stable-diffusion for direct use, misuse, malicious use, out-of-scope use, limitations, and bias.

Training

Training Data

The model developers used the following dataset for training the model:

  • LAION-5B and subsets (details below). The training data is further filtered using LAION’s NSFW detector, with a “p_unsafe” score of 0.1 (conservative). For more details, please refer to LAION-5B’s NeurIPS 2022 paper and reviewer discussions on the topic.

Training Procedure

Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,

  • Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
  • Text prompts are encoded through the OpenCLIP-ViT/H text-encoder.
  • The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
  • The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called v-objective, see https://arxiv.org/abs/2202.00512.