rossjillian / controlnet

Control diffusion models

  • Public
  • 7.5M runs
  • A100 (80GB)
  • GitHub
  • Paper
  • License

Input

image
*file

Input image

*string
Shift + Return to add a new line

Prompt for the model

*string

Controlnet structure to condition on

integer
(minimum: 1, maximum: 4)

Number of images to output (higher values may OOM)

Default: 1

integer

Resolution of output image (will be scaled to this as its smaller dimension)

Default: 512

string

Choose a scheduler.

Default: "DDIM"

integer

Steps

Default: 20

number
(minimum: 0.1, maximum: 30)

Scale for classifier-free guidance

Default: 9

integer

Seed

number

Controls the amount of noise that is added to the input data during the denoising diffusion process. Higher value -> more noise

Default: 0

string
Shift + Return to add a new line

Negative prompt

Default: "Longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality"

boolean

whether to return the reference image along with the output

Default: false

integer
(minimum: 1, maximum: 255)

[canny only] Line detection low threshold

Default: 100

integer
(minimum: 1, maximum: 255)

[canny only] Line detection high threshold

Default: 200

Output

output
Generated in

This output was created using a different version of the model, rossjillian/controlnet:d55b9f2d.

Run time and cost

This model costs approximately $0.0089 to run on Replicate, or 112 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 7 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Model by Lyumin Zhang

Usage

Input an image, and prompt the model to generate an image as you would for Stable Diffusion. Specify the type of structure you want to condition on.

Model description

This model is ControlNet adapting Stable Diffusion to generate images that have the same structure as an input image of your choosing, using:

  • Canny edge detection. The model is trained on data from a canny edge detector with random thresholds.

  • Depth maps. The model is trained on data from MiDaS.

  • HED edge detection.

  • Hough/MLSD line detection. The model is trained on data from a learning-based deep Hough transform that detects straight lines.

  • Normal maps. The model is trained on data with accurate, dense, far-range depth measurements.

  • Pose detection. The model is trained on data that uses a learning-based pose estimation method to “find” humans from internet.

  • Scribble. The model is trained on data synthesized from human scribbles from images using a combination of HED boundary detection and a set of strong data augmentations.

  • Semantic segmentation. The model is trained on data from a segmentation model that segments the input image into different semantic regions, and then use those regions as conditioning input when generating a new image.

ControlNet

ControlNet is a neural network structure which allows control of pretrained large diffusion models to support additional input conditions beyond prompts. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k samples). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal device. Alternatively, if powerful computation clusters are available, the model can scale to large amounts of training data (millions to billions of rows). Large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc.

Original model & code on GitHub

Citation

@misc{https://doi.org/10.48550/arxiv.2302.05543,
  doi = {10.48550/ARXIV.2302.05543},
  url = {https://arxiv.org/abs/2302.05543},
  author = {Zhang, Lvmin and Agrawala, Maneesh},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Graphics (cs.GR), Human-Computer Interaction (cs.HC), Multimedia (cs.MM), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Adding Conditional Control to Text-to-Image Diffusion Models},
  publisher = {arXiv},
  year = {2023},
  copyright = {arXiv.org perpetual, non-exclusive license}
}