pwntus / stable-diffusion-depth2img

Create variations of an image while preserving shape and depth.

  • Public
  • 7.9K runs
  • A100 (80GB)
  • GitHub
  • Paper
  • License
Iterate in playground

Input

image
string
Shift + Return to add a new line

The prompt to guide the image generation.

Default: "A fantasy landscape, trending on artstation"

string
Shift + Return to add a new line

The prompt NOT to guide the image generation. Ignored when not using guidance

*file

Image that will be used as the starting point for the process.

number

Prompt strength when providing the image. 1.0 corresponds to full destruction of information in init image.

Default: 0.8

integer
(minimum: 1, maximum: 8)

Number of images to output. Higher number of outputs may OOM.

Default: 1

integer
(minimum: 1, maximum: 500)

The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

Default: 50

number
(minimum: 1, maximum: 20)

Scale for classifier-free guidance. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.

Default: 7.5

string

Choose a scheduler.

Default: "DPMSolverMultistep"

integer

Random seed. Leave blank to randomize the seed

Output

output
Generated in

This output was created using a different version of the model, pwntus/stable-diffusion-depth2img:90a616d5.

Run time and cost

This model costs approximately $0.0095 to run on Replicate, or 105 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 7 seconds. The predict time for this model varies significantly based on the inputs.

Readme

This stable-diffusion-depth2img model is resumed from stable-diffusion-2-base (512-base-ema.ckpt) and finetuned for 200k steps. Added an extra input channel to process the (relative) depth prediction produced by MiDaS (dpt_hybrid) which is used as an additional conditioning.

Model description

Intended use

See stabilityai/stable-diffusion-2-depth for direct use, misuse, malicious use, out-of-scope use, limitations, and bias.

Training

Training Data The model developers used the following dataset for training the model:

  • LAION-5B and subsets (details below). The training data is further filtered using LAION’s NSFW detector, with a “p_unsafe” score of 0.1 (conservative). For more details, please refer to LAION-5B’s NeurIPS 2022 paper and reviewer discussions on the topic.

Training Procedure Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,

  • Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
  • Text prompts are encoded through the OpenCLIP-ViT/H text-encoder.
  • The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
  • The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called v-objective, see https://arxiv.org/abs/2202.00512.