Fine-tuning Stable Diffusion

You can fine-tune Stable Diffusion on your own images to create a new version of the model that is better at generating images of a person, object, or style.

For example, we’ve fine-tuned SDXL on:

There are multiple ways to fine-tune Stable Diffusion, such as:

  • Dreambooth
  • LoRAs (Low-Rank Adaptation)
  • Textual inversion

Each of these techniques need just a few images of the subject or style you are training. You can use the same images for all of these techniques. 5 to 10 images is usually enough.

Dreambooth is a full model fine-tune that produces checkpoints that can be used as independent models. These checkpoints are typically 2GB or larger.

Read our blog post for a guide on using Replicate for training Stable Diffusion with Dreambooth.

Google Research announced Dreambooth in August 2022 (read the paper).

LoRAs are faster to tune and more lightweight than Dreambooth. As small as 1-6MB, they are easy to share and download.

LoRAs are a general technique that accelerate the fine-tuning of models by training smaller matrices. These then get loaded into an unchanged base model to apply their affect. In February 2023 Simo Ryu published a way to fine-tune diffusion models like Stable Diffusion using LoRAs.

You can also use a lora_scale to change the strength of a LoRA.

Textual inversion does not modify any model weights. Instead it focuses on the text embedding space. It trains a new token concept (using a given word) that can be used with existing Stable Diffusion weights (read the paper).

You can fine-tune SDXL using Replicate – these fine-tunes combine LoRAs and textual inversion to create high quality results.

There are two ways to fine-tune SDXL on Replicate:

  1. Use the Replicate website to start a training, you can change the most important training parameters
  2. Use the Replicate API to train with the full range of training parameters

Whichever approach you choose, you’ll need to prepare your training data.

When running a training you need to provide a zip file containing your training images. Keep the following guidelines in mind when preparing your training images.

Images should contain only the subject itself, without background noise or other objects. They need to be in JPEG or PNG format. Dimensions, file size and filenames don’t matter.

You can use as few as 5 images, but 10-20 images is better. The more images you use, the better the fine-tune will be. Small images will be automatically upscaled. All images will be cropped to square during the training process.

Put your images in a folder and zip it up. The directory structure of the zip file doesn’t matter:

zip -r data.zip data

🍿 Watch the fine-tuning guide on YouTube

This guide uses the Replicate CLI to start the training, but if you want to use something else you can use a client library or call the HTTP API directly.

Let’s start by installing the Replicate CLI:

brew tap replicate/tap
brew install replicate

Grab your API token from replicate.com/account and set the REPLICATE_API_TOKEN environment variable.

export REPLICATE_API_TOKEN=...

You need to create a model on Replicate – it will be the destination for your trained SDXL version:

# You can also go to https://replicate.com/create
replicate model create yourname/model --hardware gpu-a40-small

Now you can start your training:

replicate train stability-ai/sdxl \
 --destination yourname/model \
 --web \
 input_images=@data.zip

The input_images parameter is required. You can pass in a URL to your uploaded zip file, or use the @ prefix to upload one from your local filesystem.

See the training inputs on the SDXL model for a full list of training options.

Visit replicate.com/trainings to follow the progress of your training job.

When the model has finished training you can run it using replicate.com/my-name/my-model, or via the API:

output = replicate.run(
    "my-name/my-model:abcde1234...",
    input={"prompt": "a photo of TOK riding a rainbow unicorn"},
)

The trained concept is named TOK by default, but you can change that by setting token_string and caption_prefix inputs during the training process.

Before fine-tuning starts, the input images are preprocessed using multiple models:

  • SwinIR upscales the input images to a higher resolution.
  • BLIP generates text captions for each input image.
  • CLIPSeg removes regions of the images that are not interesting or helpful for training.

For most users, the captions that BLIP generates for training work well. However, you can provide your own captions by adding a caption.csv file to your zip file of input images. Each image needs a caption. Here’s an example csv.