stability-ai / sdxl

A text-to-image generative AI model that creates beautiful images

  • Public
  • 77.5M runs
  • L40S
  • GitHub
  • Weights
  • Paper
  • License

Train stability-ai/sdxl

You can train SDXL on a particular object or style, and create a new model that generates images of those objects or styles. Training only requires a few images, and takes about 10-15 minutes. You can also download your fine-tuned LoRA weights to use elsewhere.

Trainings for this model run on Nvidia L40S GPU hardware, which costs $0.000975 per second.


If you haven’t yet trained a model on Replicate, we recommend you read one of the following guides.

Create training

The easiest way to train SDXL is to use the form below. Upon creation, you will be redirected to the training detail page where you can monitor your training's progress, and eventually download the weights and run the trained model.

*string

Select a model on Replicate that will be the destination for the trained version. If the model does not exist, select the "Create model" option and a field will appear to enter the name of the new model. We'll create the model for you when you create the training.

*file

A zip file containing your training images. A handful of images (5-6) is enough to fine-tune SDXL on a single person, but you might need more if your training subject is more complex or the images are very different.

Default: ""

integer

Random seed integer for reproducible training. Leave empty to use a random seed.

string
Shift + Return to add a new line

A unique string that will be trained to refer to the concept in the input images. Can be anything, but TOK works well.

Default: "TOK"

string
Shift + Return to add a new line

Text which will be used as prefix during automatic captioning. Must contain the token_string. For example, if caption text is ‘a photo of TOK’, automatic captioning will expand to ‘a photo of TOK under a bridge’, ‘a photo of TOK holding a cup’, etc.”

Default: ""

number

Number of individual training steps. Takes precedence over num_train_epochs.

Default: 1000

boolean

If you want to use face detection instead of CLIPSeg for masking. For face applications, we recommend enabling this option.

Default: false


Before fine-tuning starts, the input images are preprocessed using SwinIR for upscaling, BLIP for captioning, and CLIPSeg for removing regions of the images that are not interesting or helpful for training.

Below is a list of all fine-tuning parameters.

Training inputs

  • input_images (required): A .zip or .tar file containing the image files that will be used for fine-tuning.
  • seed: Random seed integer for reproducible training. Leave empty to use a random seed.
  • resolution: Square pixel resolution which your images will be resized to for training. Defaults to 512.
  • train_batch_size: Batch size (per device) for training. Defaults to 4.
  • num_train_epochs: Number of epochs to loop through your training dataset. Defaults to 4000.
  • max_train_steps: Number of individual training steps. Takes precedence over num_train_epochs. Defaults to 1000.
  • is_lora: Boolean indicating whether to use LoRA training. If set to False, will use Full fine tuning. Defaults to True.
  • unet_learning_rate: Learning rate for the U-Net as a float. We recommend this value to be somewhere between 1e-6: to 1e-5. Defaults to 1e-6.
  • ti_lr: Scaling of learning rate for training textual inversion embeddings. Don’t alter unless you know what you’re doing. Defaults to 3e-4.
  • lora_lr: Scaling of learning rate for training LoRA embeddings. Don’t alter unless you know what you’re doing. Defaults to 1e-4.
  • lr_scheduler: Learning rate scheduler to use for training. Allowable values are constant or linear. Defaults to constant.
  • lr_warmup_steps: Number of warmup steps for lr schedulers with warmups. Defaults to 100.
  • token_string: A unique string that will be trained to refer to the concept in the input images. Can be anything, but TOK works well. Defaults to TOK.
  • caption_prefix: Text which will be used as prefix during automatic captioning. Must contain the token_string. For example, if caption text is ‘a photo of TOK’, automatic captioning will expand to ‘a photo of TOK under a bridge’, ‘a photo of TOK holding a cup’, etc.”, Defaults to a photo of TOK.
  • mask_target_prompts: Prompt that describes part of the image that you will find important. For example, if you are fine-tuning your pet, photo of a dog will be a good prompt. Prompt-based masking is used to focus the fine-tuning process on the important/salient parts of the image. Defaults to None.
  • crop_based_on_salience: If you want to crop the image to target_size: based on the important parts of the image, set this to True. If you want to crop the image based on face detection, set this to False. Defaults to True.
  • use_face_detection_instead: If you want to use face detection instead of CLIPSeg for masking. For face applications, we recommend using this option. Defaults to False.
  • clipseg_temperature: How blurry you want the CLIPSeg mask to be. We recommend this value be something between 0.5: to 1.0. If you want to have more sharp mask (but thus more errorful), you can decrease this value. Defaults to 1.0.
  • verbose: Verbose output. Defaults to True.
  • checkpointing_steps: Number of steps between saving checkpoints. Set to very very high number to disable checkpointing, because you don’t need one. Defaults to 200.