A to Z of Stable Diffusion

An open-source power user interface for Stable Diffusion. Sometimes referred to as ‘A1111’.

Often used in generative models, it’s used to control the influence of a prompt (or another guiding signal) on a generated output. Higher values will give outputs closer to the prompt, but at the cost of output diversity and creativity.

A model that guides the generation of a new image based on aspects or features of an input image. The type of guidance depends on the ControlNet used. A preprocessor is often needed to convert an input image into a format that can guide the generation process. Used alongside Stable Diffusion.

Examples include:

  • edge detection (canny)
  • depth map
  • segmentation
  • human pose

Try out ControlNet with SDXL
Watch a video guide to ControlNet models

A neural network component that reconstructs data from encoded representations.

The step-by-step process of gradually transforming noise into a coherent output.

A parameter controlling image alteration in img2img.

Denoising strength controls how much noise is added to the initial image. More noise means more of the original image will change. This gives more opportunity for the diffusion process to match a given prompt (ie prompt strength).

A depth map is generated from an input image, usually as a preprocessor for a ControlNet model. This depth map is then used to guide the generation of a new image, leading to a new image with a similar structure.

There are different models for generating depth maps, including:

  • Midas
  • Leres
  • Zoe

Try out depth maps and other ControlNet preprocessors

A diffusion model is a type of generative AI model that transforms random noise into structured data, such as images, audio, or text. It gradually shapes this noise through a series of steps to produce coherent and detailed outputs.

Embeddings are representations of items like words, sentences, or image features. They are in the form of vectors in a continuous vector space. These representations capture the characteristics or features of the original data, allowing for efficient processing and analysis by AI models.

A neural network component that compresses data into a compact representation.

One complete pass of the training dataset through the algorithm. During an epoch, a model has the opportunity to learn from each example in the dataset.

Adjusting a pre-trained model for specific tasks or improvements.

Guides:

Hyperparameters define how a model is structured. They can be tuned for better model performance. A guide to hyperparameter tuning by Jeremy Jordan.

Transforming one image into another, often guided by a text prompt. How much an image changes is controlled by the denoising strength parameter.

Running a trained model to get an output. In machine learning, and on Replicate, these outputs are called predictions.

Changing specific areas of an image. The areas are specified by a mask.

An example of inpainting with SDXL

A high-dimensional space where AI models represent data.

Assessing the performance of a machine learning model.

A text input specifying what should not appear in a generated output. A text prompt asking for a photo of a cat might be alongside a negative prompt of ‘art, illustration, render’, to avoid getting images of cartoon cats.

Try using negative prompts with Stable Diffusion XL

A system designed to mimic the way human brains analyze and process information. It consists of interconnected nodes that work together to recognize patterns and make decisions based on input data.

Nodes are aggregated into layers. Signals travel from the input layer to the output layer via these hidden layers.

Learn more about neural networks

Overtraining. It happens when a model learns its training data too thoroughly. An overfit model will perform poorly on new, unseen data, as it fails to generalize from the specific examples it was trained on.

If you are fine-tuning, try to use a more diverse training dataset or train with fewer steps.

Predictions in machine learning refer to the output generated by a model when it is given new, unseen data. Based on the patterns and relationships it has learned during training, the model estimates or forecasts likely outcomes for this new data.

View your predictions on Replicate

Text input to a generative AI model describing the desired output.

Crafting very good text inputs to guide AI models to better outputs. Often with an understanding of the model’s characteristics and limitations.

An algorithm that determines the denoising process for a diffusion model. They play a critical role in determining how the noise is incrementally reduced (denoising) to form the final output.

They are called schedulers because they determine the noise schedule used during the diffusion process. They are sometimes called samplers because the denoising process creates a sample at each step.

Example schedulers include:

  • Euler
  • DDIM
  • DPM++ 2M Karras

Learn more about schedulers on HuggingFace

A collection of open-source AI models for text-to-image generation.

Applying the style of one image to another.

Generating images from text prompts using AI.

A neural network predicting noise in each sampling step in Stable Diffusion.

Increasing image resolution while enhancing details using an AI model.

A VAE can:

  • encode images into latent space
  • decode latents back into an image

Rather than working with pixels, which would be very slow, many diffusion models work in a latent space that is much smaller. This allows them to be more efficient.

During training, training data is encoded into latent space. During inference, the output of the diffusion process is decoded back into an image.