Table of contents
You can fine-tune Stable Diffusion on your own images to create a new version of the model that is better at generating images of a person, object, or style.
For example, we’ve fine-tuned SDXL on:
There are multiple ways to fine-tune Stable Diffusion, such as:
Each of these techniques need just a few images of the subject or style you are training. You can use the same images for all of these techniques. 5 to 10 images is usually enough.
Dreambooth is a full model fine-tune that produces checkpoints that can be used as independent models. These checkpoints are typically 2GB or larger.
Read our blog post for a guide on using Replicate for training Stable Diffusion with Dreambooth.
Google Research announced Dreambooth in August 2022 (read the paper).
LoRAs are faster to tune and more lightweight than Dreambooth. As small as 1-6MB, they are easy to share and download.
LoRAs are a general technique that accelerate the fine-tuning of models by training smaller matrices. These then get loaded into an unchanged base model to apply their affect. In February 2023 Simo Ryu published a way to fine-tune diffusion models like Stable Diffusion using LoRAs.
You can also use a lora_scale
to change the strength of a LoRA.
Textual inversion does not modify any model weights. Instead it focuses on the text embedding space. It trains a new token concept (using a given word) that can be used with existing Stable Diffusion weights (read the paper).
You can fine-tune SDXL using Replicate – these fine-tunes combine LoRAs and textual inversion to create high quality results.
There are two ways to fine-tune SDXL on Replicate:
Whichever approach you choose, you’ll need to prepare your training data.
When running a training you need to provide a zip file containing your training images. Keep the following guidelines in mind when preparing your training images.
Images should contain only the subject itself, without background noise or other objects. They need to be in JPEG or PNG format. Dimensions, file size and filenames don't matter.
You can use as few as 5 images, but 10-20 images is better. The more images you use, the better the fine-tune will be. Small images will be automatically upscaled. All images will be cropped to square during the training process.
Put your images in a folder and zip it up. The directory structure of the zip file doesn't matter:
This guide uses the Replicate CLI to start the training, but if you want to use something else you can use a client library or call the HTTP API directly.
Let’s start by installing the Replicate CLI:
Grab your API token from replicate.com/account and set the REPLICATE_API_TOKEN environment variable.
You need to create a model on Replicate – it will be the destination for your trained SDXL version:
Now you can start your training:
The input_images
parameter is required. You can pass in a URL to your uploaded zip file, or use the @
prefix to upload one from your local filesystem.
See the training inputs on the SDXL model for a full list of training options.
Visit replicate.com/trainings to follow the progress of your training job.
When the model has finished training you can run it using replicate.com/my-name/my-model, or via the API:
The trained concept is named TOK
by default, but you can change that by setting token_string
and caption_prefix
inputs during the training process.
Before fine-tuning starts, the input images are preprocessed using multiple models:
For most users, the captions that BLIP generates for training work well. However, you can provide your own captions by adding a caption.csv
file to your zip file of input images. Each image needs a caption. Here's an example csv.