🚀 Want to run this model with an API? Get started


Create variations of an image while preserving shape and depth.
1.8K runs

Run time and cost

Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 21 seconds. The predict time for this model varies significantly based on the inputs.

This stable-diffusion-depth2img model is resumed from stable-diffusion-2-base (512-base-ema.ckpt) and finetuned for 200k steps. Added an extra input channel to process the (relative) depth prediction produced by MiDaS (dpt_hybrid) which is used as an additional conditioning.

Model description

Intended use

See stabilityai/stable-diffusion-2-depth for direct use, misuse, malicious use, out-of-scope use, limitations, and bias.


Training Data
The model developers used the following dataset for training the model:

  • LAION-5B and subsets (details below). The training data is further filtered using LAION's NSFW detector, with a "p_unsafe" score of 0.1 (conservative). For more details, please refer to LAION-5B's NeurIPS 2022 paper and reviewer discussions on the topic.

Training Procedure
Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,

  • Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
  • Text prompts are encoded through the OpenCLIP-ViT/H text-encoder.
  • The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
  • The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called v-objective, see https://arxiv.org/abs/2202.00512.