CompVis `latent-diffusion text2im` finetuned for inpainting.

  • Public
  • 7.9K runs

Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 20 seconds. The predict time for this model varies significantly based on the inputs.


Latent Diffusion

CompVis’ 1.4B parameter Latent Diffusion Text-To-Image model finetuned for inpainting, logo generation, art generation and more.

This repo is modified from glid-3-xl. Aesthetic CLIP embeds are provided by aesthetic-predictor

Quick start (docker required)

The following command will download all weights and run a prediction with your inputs inside a proper docker container.

cog predict \
  -i prompt="an armchair in the form of an avocado" \
  -i negative="" \
  -i init_image=@path/to/image \
  -i mask=@path/to/mask \
  -i guidance_scale=5.0 \
  -i steps=100 \
  -i batch_size=4 \
  -i width=256 \
  -i height=256 \
  -i init_skip_fraction=0.0 \
  -i aesthetic_rating=9 \
  -i aesthetic_weight=0.5 \
  -i seed=-1 \
  -i intermediate_outputs=False

Valid remote image URL’s are:



Please ensure the following dependencies are installed prior to building this repo:

  • build-essential
  • libopenmpi-dev
  • liblzma-dev
  • zlib1g-dev


It’s a good idea to use a virtual environment or a conda environment.

python3 -m venv .venv
source venv/bin/activate
(venv) $

Before installing, you should install pytorch manually by following the instructions at

In my instance, I needed the following for cuda 11.3.

(venv) $ pip install torch torchvision --extra-index-url

To check your cuda version, run nvidia-smi.

Install ldm-finetune

You can now install this repo by running pip install -e . in the project directory.

(venv) $ git clone
(venv) $ cd ldm-finetune
(venv) $ pip install -e .
(venv) $ pip install -r requirements.txt


Foundation/Backbone models:

wget -O textual.onnx ''
wget -O visual.onnx ''

### BERT Text Encoder
wget --continue

### kl-f8 VAE backbone
wget --continue

Latent Diffusion Stage 2 (diffusion)

There are several stage 2 checkpoints to choose from:

The second finetune from jack000’s glid-3-xl adds support for inpainting and can be used for unconditional output as well by setting the inpaint image_embed to zeros. Additionally finetuned to use the CLIP text embed via cross-attention (similar to unCLIP).

wget –continue

LAION Finetuning Checkpoints

Laion also finetuned with the aim of improving logo generation and painting generation.


erlich is finetuned on a dataset collected from LAION-5B named Large Logo Dataset. It consists of roughly 100K images of logos with captions generated via BLIP using aggressive re-ranking and filtering.

wget --continue -O

“You know aviato?”


Ongo is finetuned on the Wikiart dataset consisting of about 100K paintings with captions generated via BLIP using aggressive re-ranking and filtering. We also make use of the original captions which contain the author name and the painting title.


“Ongo Gablogian, the art collector. Charmed, I’m sure.”


puck has been trained on pixel art. While the underlying kl-f8 encoder seems to struggle somewhat with pixel art, results are still interesting.



### CompVis - ``
# The original checkpoint from CompVis trained on `LAION-400M`. May output watermarks.
wget --continue

### jack000 - ``
# The first finetune from jack000's [glid-3-xl]( Modified to accept a CLIP text embed and finetuned on curated data to help with watermarks. Doesn't support inpainting.
# wget 

Generating images

You can run prediction via python or docker. Currently the docker method is best supported.


If you have access to a linux machine (or WSL2.0 on Windows 11) with docker installed, you can very easily run models by installing cog:

sudo curl -o /usr/local/bin/cog -L`uname -s`_`uname -m`
sudo chmod +x /usr/local/bin/cog

Modify the MODEL_PATH in

MODEL_PATH = ""  # Can be erlich, ongo, puck, etc.

Now you can run predictions via docker container using:

cog predict -i prompt="a logo of a fox made of fire"

Output will be returned as a base64 string at the end of generation and is also saved locally at current_{batch_idx}.png

Flask API

If you’d like to stand up your own ldm-finetune Flask API, you can run:

cog build -t my_ldm_image
docker run -d -p 5000:5000 --gpus all my_ldm_image

Predictions can then be accessed via HTTP:

curl http://localhost:5000/predictions -X POST \
    -H 'Content-Type: application/json' \
    -d '{"input": {"prompt": "a logo of a fox made of fire"}}'

The output from the API will be a list of base64 strings representing your generations.