jagilley / controlnet-scribble

Generate detailed images from scribbled drawings

  • Public
  • 38.2M runs
  • A100 (80GB)
  • GitHub
  • License

Input

image
*file

Input image

*string
Shift + Return to add a new line

Prompt for the model

string

Number of samples (higher values may OOM)

Default: "1"

string

Image resolution to be generated

Default: "512"

integer

Steps

Default: 20

number
(minimum: 0.1, maximum: 30)

Guidance Scale

Default: 9

integer

Seed

number

eta (DDIM)

Default: 0

string
Shift + Return to add a new line

Added Prompt

Default: "best quality, extremely detailed"

string
Shift + Return to add a new line

Negative Prompt

Default: "longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality"

Output

outputoutput
Generated in

Run time and cost

This model costs approximately $0.0083 to run on Replicate, or 120 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 6 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Model by Lyumin Zhang

Usage

To start, draw a picture of something that you’d like to generate an image of. There’s a nice Replicate-powered UI for doing this at https://scribblediffusion.com/, though you can also upload an image through this UI or using the Replicate python API.

Then, prompt the model to generate an image as you would for Stable Diffusion. The model generating the image will use your drawing as a template to guide image generation. At the end, you should have an image that looks like your drawing:

Model description

This model is ControlNet adapting Stable Diffusion to use a line drawing (or “scribble”) in addition to a text input to generate an output image.

ControlNet is a neural network structure which allows control of pretrained large diffusion models to support additional input conditions beyond prompts. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k samples). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal device. Alternatively, if powerful computation clusters are available, the model can scale to large amounts of training data (millions to billions of rows). Large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc.

Original model & code on GitHub

Other ControlNets

There are many different ways to use a ControlNet to modify the output of Stable Diffusion. Here are a few different options, all of which use an input image in addition to a prompt to generate an output. The methods process the input in different ways; try them out to see which works best for a given application.

ControlNet for generating images from drawings Scribble: https://replicate.com/jagilley/controlnet-scribble

ControlNets for generating humans based on input image Human Pose Detection: https://replicate.com/jagilley/controlnet-pose

ControlNets for preserving general qualities about an input image Edge detection: https://replicate.com/jagilley/controlnet-canny HED maps: https://replicate.com/jagilley/controlnet-hed Depth map: https://replicate.com/jagilley/controlnet-depth2img Hough line detection: https://replicate.com/jagilley/controlnet-hough Normal map: https://replicate.com/jagilley/controlnet-normal

Citation

@misc{https://doi.org/10.48550/arxiv.2302.05543,
  doi = {10.48550/ARXIV.2302.05543},
  url = {https://arxiv.org/abs/2302.05543},
  author = {Zhang, Lvmin and Agrawala, Maneesh},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Graphics (cs.GR), Human-Computer Interaction (cs.HC), Multimedia (cs.MM), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Adding Conditional Control to Text-to-Image Diffusion Models},
  publisher = {arXiv},
  year = {2023},
  copyright = {arXiv.org perpetual, non-exclusive license}
}