jagilley / controlnet

Modify images with a prompt while preserving their structure

  • Public
  • 55.2K runs
  • GitHub
  • License

Input

Output

Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 4 minutes. The predict time for this model varies significantly based on the inputs.

Readme

Model by Lyumin Zhang

Usage

Input an image, and prompt the model to generate an image as you would for Stable Diffusion.

Detail detection methods

Use one of eight different methods for detecting the details in the original image: - Canny edge detection: automatically detect edges in the image using adjustable thresholds - Depth detection: automatically detect the depths within the image, then diffuse based on the detected depths - HED: detect edges in the image more softly than with the ‘canny’ method - Normal maps: automatically detect the geometry of the input image, then diffuse based on the original geometry - Scribble: use a user-drawn scribble image as a basis for the final image - Seg: apply semantic segmentation to the input image, then diffuse with respect to the resulting partition - Openpose: detect the pose of any humans in the image, then generate an image with a human in the same pose

Model description

ControlNet is a neural network structure which allows control of pretrained large diffusion models to support additional input conditions beyond prompts. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k samples). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal device. Alternatively, if powerful computation clusters are available, the model can scale to large amounts of training data (millions to billions of rows). Large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc.

Original model & code on GitHub

Other ControlNet Models

This is a general ControlNet model which allows you to select any of the eight detail detection methods. However, you can also use a model which is specific to one of the particular methods. For applications where you expect to call the model a large number of times with an API, these may perform better.

ControlNet for generating images from drawings Scribble: https://replicate.com/jagilley/controlnet-scribble

ControlNets for generating humans based on input image Human Pose Detection: https://replicate.com/jagilley/controlnet-pose

ControlNets for preserving general qualities about an input image Edge detection: https://replicate.com/jagilley/controlnet-canny HED maps: https://replicate.com/jagilley/controlnet-hed Depth map: https://replicate.com/jagilley/controlnet-depth2img Hough line detection: https://replicate.com/jagilley/controlnet-hough Normal map: https://replicate.com/jagilley/controlnet-normal

Citation

@misc{https://doi.org/10.48550/arxiv.2302.05543,
  doi = {10.48550/ARXIV.2302.05543},
  url = {https://arxiv.org/abs/2302.05543},
  author = {Zhang, Lvmin and Agrawala, Maneesh},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Graphics (cs.GR), Human-Computer Interaction (cs.HC), Multimedia (cs.MM), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Adding Conditional Control to Text-to-Image Diffusion Models},
  publisher = {arXiv},
  year = {2023},
  copyright = {arXiv.org perpetual, non-exclusive license}
}