Using ControlNet with Stable Diffusion

ControlNet is a method for conforming your image generations to a particular structure. It’s easiest explained by an example.

Let’s say you wanted to create an image of a particular pose—a photo like the man below, but we want a boy doing it.

a man in a suit

This is a really hard problem with default stable diffusion. We could try a prompt that describes our desired pose in detail, like this:

a photo of a boy wearing a blue jacket looking to his right with his left arm slightly above his right

But this probably isn’t going to work. Our outputs aren’t going to conform to all these instructions, and even if we got lucky and they did, we won’t be able to consistently generate this pose. (why? — my guess is because the decoder can only conform to so much? or maybe this is possible, it’s just a pain?)

Enter ControlNet. ControlNet models accept two inputs:

A text prompt: a boy wearing a blue jacket

And a conditioning image. There are lots of types, but for now let’s use a stick figure (often called human pose): stick figure

The model then uses our pose to conform/shape/control our image into our desired structure:

boy

Let’s try changing our prompt, but keep the pose input the same:

“chef in kitchen” chef

“Lincoln statue”

lincoln

Source: ControlNet paper

And voila! We can create all kinds of images that are guided by the stick figure pose.

That’s the essence of ControlNet. We always provide two inputs: a text prompt (just like normal Stable Diffusion) and a conditioning image. The output is guided by our conditioning image.

diagram

Importantly, the conditioning image isn’t restricted to stick figure poses. We can provide all kinds of custom compositions for the ControlNet model to follow, like edges, depths, scribbles, segmentation and many other structures. More on that later.

ControlNet was created by Stanford researchers and announced in the paper Adding Conditional Control to Text-to-Image Diffusion Models.

It’s trained on top of stable diffusion, so the flexibility and aesthetic of stable diffusion is still there.

I don’t want to go too in depth, but training ControlNet involves using conditioning images to teach the model how to generate images according to specific conditions. These conditioning images act like guidelines or templates. For example, if the model is to learn how to generate images of cats, it might be trained with a variety of images showing cats in different poses or environments. Alongside these images, text prompts are used to describe the scene or the object. This combination of visual and textual input helps the model understand not just what to draw (from the text) but also how to draw it (from the conditioning images).

To learn more, check out the paper. Or do what I did, and upload the paper to ChatGPT and ask it a bunch of questions.

There are many types of conditioning_image’s we can use in ControlNet. We’re not restricted to our human pose. We can guide our images using edge detectors, depth maps, segmentation, sketchs, and more.

types

The best way to see how each of these work is by example.

Let’s run a selection of conditioners on the following two images with completely new prompts.

Cyberpunk Couple New prompt:

couple embracing, summer clothes,
color photo, 50mm prime, canon, studio
portrait

couple

Tea Room New prompt:

a photo of a beautiful cyberpunk
cafe, dawn light, warm tones,
morning scene, windows

tea

Let’s see what happens when we apply different ControlNets on these two images. In other words, let’s go through this process for each type of conditioner:

process

Canny is a widely used edge detector.

Here’s our tea room after going through the Canny pre-processor: tea1

canny

Remember our prompt from earlier?

"a photo of a beautiful cyberpunk
cafe, dawn light, warm tones,
morning scene, windows"

Now, when we generate an image with our new prompt, ControlNet will generate an image based on this prompt, but guided by the Canny edge detection: Result result Here’s that same process applied to our image of the couple, with our new prompt:

couple1

HED is another kind of edge detector. Here’s our pre-processed output: tea2

hed And here’s our output:

output The effect of HED is a bit more clear on our cyberpunk couple: couple2

Note that many details that were lost in the Canny version are present in HED detection, like her hand placement, the angle of the overhead light, and her headpiece. Note also where HED struggles—his collar becomes a streak of hair, and her makeup becomes a shadow. HED works really well for painting and art.

M-LSD is another kind of edge detection that only works on straight lines. Here’s M-LSD on our tea room:

tea3

mlsd And on our cyberpunk couple:

couple3

As you can tell, M-LSD doesn’t work well here—there aren’t many straight lines in the original.

A depth map works by simulating the distance of parts of the image from the camera. Here’s what it looks like: tea4

depth

result2

This works really well on our tea room. The frames, plants, tables and pillows are all preserved. Let’s try it on our couple: couple4 Again, the composition is preserved, but it struggles with the collar and hands.

This one should be familiar to you! Open pose/human pose turns images of people into stick figures. Let’s try it on our couple: couple5

The colored bars are representations of the body parts of our characters. You can see that the poses are preserved, but the detail and style is lost.

Scribble is behind the very popular “turn a sketch into a drawing” apps. I’m going to use a different example for scribble, because it works best with a doodle input conditioner. For example, here’s a doodle of a cat alongside my text prompt, “an oil painting of a cat”: cat Here’s the same with an owl: owl

Segmentation breaks our image down into different segments.

tea5

segments

result3

We’ve maintained the plants, the frames, and the table, but the output is quite different. There’s a bit of a fish eye effect, and we’ve lost the pillars in the cafe. Here’s segmentation on our couple.

couple6

A normal map detects the texture of an image. tea6

normal

result4

This is a nice output, but it doesn’t preserve our original input at all. A normal map works much better on our couple:

couple7

Using ControlNet is easy with Replicate 😎. We have a collection of ControlNet models here.

collection You can get started by choosing a ControlNet model and playing around with it in our GUI. If you’re a developer and want to integrate ControlNet into your app, click the API tab and you’ll be able to copy and paste the API request into your codebase. Happy hacking!

gui