To generates images, enter a
prompt and run the model.
SDXL supports in-painting, which lets you “fill in” parts of an existing image with generated content.
- Enter a
promptfor the in-painted pixels
- Select an input image in the
- In the
maskfield, select a black-and-white mask image of the same shape as the input image. All white pixels will be in-painted according to the prompt, which black pixels will be preserved.
Image-to-image lets you start with an input image and transform it “towards” a prompt. For example, you can transform a children’s drawing of castle to a photorealistic castle.
- Enter a
promptthat describes what you want the output image to look like
- Select an input image in the
prompt_strengthfield changes how strongly the prompt is applied to the input image
With SDXL you can use a separate refiner model to add finer detail to your output.
You can use the refiner in two ways:
- As an ensemble of experts
- One after the other (
Ensemble of experts
- In this mode the SDXL base model handles the steps at the beginning (high noise), before handing over to the refining model for the final steps (low noise)
- You get a more detailed image from fewer steps
- You can change the point at which that handover happens, we default to 0.8 (80%)
One after the other
- In this mode you take your final output from SDXL base model and pass it to the refiner
- You can define how many steps the refiner takes
You can fine-tune SDXL using the Replicate fine-tuning API. Many languages are supported, but in this example we’ll use the Python SDK:
training = replicate.trainings.create(
Before fine-tuning starts, the input images are preprocessed using SwinIR for upscaling, BLIP for captioning, and CLIPSeg for removing regions of the images that are not interesting or helpful for training.
Below is a list of all fine-tuning parameters.
input_images(required): A .zip or .tar file containing the image files that will be used for fine-tuning.
seed: Random seed integer for reproducible training. Leave empty to use a random seed.
resolution: Square pixel resolution which your images will be resized to for training. Defaults to
train_batch_size: Batch size (per device) for training. Defaults to
num_train_epochs: Number of epochs to loop through your training dataset. Defaults to
max_train_steps: Number of individual training steps. Takes precedence over num_train_epochs. Defaults to
is_lora: Boolean indicating whether to use LoRA training. If set to False, will use Full fine tuning. Defaults to
unet_learning_rate: Learning rate for the U-Net as a float. We recommend this value to be somewhere between
1e-5. Defaults to
ti_lr: Scaling of learning rate for training textual inversion embeddings. Don’t alter unless you know what you’re doing. Defaults to
lora_lr: Scaling of learning rate for training LoRA embeddings. Don’t alter unless you know what you’re doing. Defaults to
lr_scheduler: Learning rate scheduler to use for training. Allowable values are
linear. Defaults to
lr_warmup_steps: Number of warmup steps for lr schedulers with warmups. Defaults to
token_string: A unique string that will be trained to refer to the concept in the input images. Can be anything, but TOK works well. Defaults to
caption_prefix: Text which will be used as prefix during automatic captioning. Must contain the
token_string. For example, if caption text is ‘a photo of TOK’, automatic captioning will expand to ‘a photo of TOK under a bridge’, ‘a photo of TOK holding a cup’, etc.”, Defaults to
a photo of TOK.
mask_target_prompts: Prompt that describes part of the image that you will find important. For example, if you are fine-tuning your pet,
photo of a dogwill be a good prompt. Prompt-based masking is used to focus the fine-tuning process on the important/salient parts of the image. Defaults to None.
crop_based_on_salience: If you want to crop the image to
target_size: based on the important parts of the image, set this to True. If you want to crop the image based on face detection, set this to False. Defaults to
use_face_detection_instead: If you want to use face detection instead of CLIPSeg for masking. For face applications, we recommend using this option. Defaults to
clipseg_temperature: How blurry you want the CLIPSeg mask to be. We recommend this value be something between
1.0. If you want to have more sharp mask (but thus more errorful), you can decrease this value. Defaults to
verbose: Verbose output. Defaults to
checkpointing_steps: Number of steps between saving checkpoints. Set to very very high number to disable checkpointing, because you don’t need one. Defaults to
- Developed by: Stability AI
- Model type: Diffusion-based text-to-image generative model
- License: CreativeML Open RAIL++-M License
- Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Latent Diffusion Model that uses two fixed, pretrained text encoders (OpenCLIP-ViT/G and CLIP-ViT/L).
- Resources for more information: Check out our GitHub Repository and the SDXL report on arXiv.
The chart above evaluates user preference for SDXL (with and without refinement) over SDXL 0.9 and Stable Diffusion 1.5 and 2.1. The SDXL base model performs significantly better than the previous variants, and the model combined with the refinement module achieves the best overall performance.
The model is intended for research purposes only. Possible research areas and tasks include
- Generation of artworks and use in design and other artistic processes.
- Applications in educational or creative tools.
- Research on generative models.
- Safe deployment of models which have the potential to generate harmful content.
- Probing and understanding the limitations and biases of generative models.
- Excluded uses are described below.
The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
Limitations and Bias
- The model does not achieve perfect photorealism
- The model cannot render legible text
- The model struggles with more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”
- Faces and people in general may not be generated properly.
- The autoencoding part of the model is lossy.
- While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
SDXL consists of a mixture-of-experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.
Alternatively, we can use a two-stage pipeline as follows: First, the base model is used to generate latents of the desired output size. In the second step, we use a specialized high-resolution model and apply a technique called SDEdit (https://arxiv.org/abs/2108.01073, also known as “img2img”) to the latents generated in the first step, using the same prompt. This technique is slightly slower than the first one, as it requires more function evaluations.