This `stable-diffusion-depth2img`

model is resumed from stable-diffusion-2-base (`512-base-ema.ckpt`

) and finetuned for 200k steps. Added an extra input channel to process the (relative) depth prediction produced by MiDaS (`dpt_hybrid`

) which is used as an additional conditioning.

## Model description

**Developed by:**Robin Rombach, Patrick Esser**Model type:**Diffusion-based text-to-image generation model**Language(s):**English**License:**CreativeML Open RAIL++-M License**Model Description:**This is a model that can be used to generate and modify images based on text prompts. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (OpenCLIP-ViT/H).**Resources for more information:**GitHub Repository.

## Intended use

See stabilityai/stable-diffusion-2-depth for direct use, misuse, malicious use, out-of-scope use, limitations, and bias.

## Training

**Training Data**
The model developers used the following dataset for training the model:

- LAION-5B and subsets (details below). The training data is further filtered using LAION’s NSFW detector, with a “p_unsafe” score of 0.1 (conservative). For more details, please refer to LAION-5B’s NeurIPS 2022 paper and reviewer discussions on the topic.

**Training Procedure**
Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,

- Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
- Text prompts are encoded through the OpenCLIP-ViT/H text-encoder.
- The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called
*v-objective*, see https://arxiv.org/abs/2202.00512.