rossjillian / controlnet_2-1

ControlNet with SD 2.1

  • Public
  • 14.2K runs

Input

Output

Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 80 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Model by Lyumin Zhang. This is the version trained on Stable Diffusion 2.1 by Thibaud Zamora.

Usage

Input an image, and prompt the model to generate an image as you would for Stable Diffusion. Specify the type of structure you want to condition on.

Model description

This model is ControlNet adapting Stable Diffusion to generate images that have the same structure as an input image of your choosing, using:

  • Canny edge detection. The model is trained on data from a canny edge detector with random thresholds.

  • Depth maps. The model is trained on data from MiDaS.

  • HED edge detection.

  • Hough/MLSD line detection. The model is trained on data from a learning-based deep Hough transform that detects straight lines.

  • Normal maps. The model is trained on data with accurate, dense, far-range depth measurements.

  • Pose detection. The model is trained on data that uses a learning-based pose estimation method to “find” humans from internet.

  • Scribble. The model is trained on data synthesized from human scribbles from images using a combination of HED boundary detection and a set of strong data augmentations.

  • Semantic segmentation. The model is trained on data from a segmentation model that segments the input image into different semantic regions, and then use those regions as conditioning input when generating a new image.

ControlNet

ControlNet is a neural network structure which allows control of pretrained large diffusion models to support additional input conditions beyond prompts. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k samples). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal device. Alternatively, if powerful computation clusters are available, the model can scale to large amounts of training data (millions to billions of rows). Large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc.

Original model & code on GitHub

Citation

@misc{https://doi.org/10.48550/arxiv.2302.05543,
  doi = {10.48550/ARXIV.2302.05543},
  url = {https://arxiv.org/abs/2302.05543},
  author = {Zhang, Lvmin and Agrawala, Maneesh},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Graphics (cs.GR), Human-Computer Interaction (cs.HC), Multimedia (cs.MM), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Adding Conditional Control to Text-to-Image Diffusion Models},
  publisher = {arXiv},
  year = {2023},
  copyright = {arXiv.org perpetual, non-exclusive license}
}