afiaka87 / glid-3-xl

CompVis `latent-diffusion text2im` finetuned for inpainting.

  • Public
  • 7.9K runs
  • GitHub
  • License

Readme

Latent Diffusion

CompVis’ 1.4B parameter Latent Diffusion Text-To-Image model finetuned for inpainting, logo generation, art generation and more.

This repo is modified from glid-3-xl. Aesthetic CLIP embeds are provided by aesthetic-predictor

Quick start (docker required)

The following command will download all weights and run a prediction with your inputs inside a proper docker container.

cog predict r8.im/laion-ai/erlich \
  -i prompt="an armchair in the form of an avocado" \
  -i negative="" \
  -i init_image=@path/to/image \
  -i mask=@path/to/mask \
  -i guidance_scale=5.0 \
  -i steps=100 \
  -i batch_size=4 \
  -i width=256 \
  -i height=256 \
  -i init_skip_fraction=0.0 \
  -i aesthetic_rating=9 \
  -i aesthetic_weight=0.5 \
  -i seed=-1 \
  -i intermediate_outputs=False

Valid remote image URL’s are:

Setup

Prerequisites

Please ensure the following dependencies are installed prior to building this repo:

  • build-essential
  • libopenmpi-dev
  • liblzma-dev
  • zlib1g-dev

Pytorch

It’s a good idea to use a virtual environment or a conda environment.

python3 -m venv .venv
source venv/bin/activate
(venv) $

Before installing, you should install pytorch manually by following the instructions at pytorch.org

In my instance, I needed the following for cuda 11.3.

(venv) $ pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113

To check your cuda version, run nvidia-smi.

Install ldm-finetune

You can now install this repo by running pip install -e . in the project directory.

(venv) $ git clone https://github.com/laion-ai/ldm-finetune.git
(venv) $ cd ldm-finetune
(venv) $ pip install -e .
(venv) $ pip install -r requirements.txt

Checkpoints

Foundation/Backbone models:

# CLIP-ONNX
wget -O textual.onnx 'https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-L-14/textual.onnx'
wget -O visual.onnx 'https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-L-14/visual.onnx'

### BERT Text Encoder
wget --continue https://dall-3.com/models/glid-3-xl/bert.pt

### kl-f8 VAE backbone
wget --continue https://dall-3.com/models/glid-3-xl/kl-f8.pt

Latent Diffusion Stage 2 (diffusion)

There are several stage 2 checkpoints to choose from:

The second finetune from jack000’s glid-3-xl adds support for inpainting and can be used for unconditional output as well by setting the inpaint image_embed to zeros. Additionally finetuned to use the CLIP text embed via cross-attention (similar to unCLIP).

wget –continue https://dall-3.com/models/glid-3-xl/inpaint.pt

LAION Finetuning Checkpoints

Laion also finetuned inpaint.pt with the aim of improving logo generation and painting generation.

Erlich

erlich is inpaint.pt finetuned on a dataset collected from LAION-5B named Large Logo Dataset. It consists of roughly 100K images of logos with captions generated via BLIP using aggressive re-ranking and filtering.

wget --continue -O erlich.pt https://huggingface.co/laion/erlich/resolve/main/model/ema_0.9999_120000.pt

“You know aviato?”

Ongo

Ongo is inpaint.pt finetuned on the Wikiart dataset consisting of about 100K paintings with captions generated via BLIP using aggressive re-ranking and filtering. We also make use of the original captions which contain the author name and the painting title.

wget https://huggingface.co/laion/ongo/resolve/main/ongo.pt

“Ongo Gablogian, the art collector. Charmed, I’m sure.”

LAION - puck.pt

puck has been trained on pixel art. While the underlying kl-f8 encoder seems to struggle somewhat with pixel art, results are still interesting.

wget https://huggingface.co/laion/puck/resolve/main/puck.pt

Other

### CompVis - `diffusion.pt`
# The original checkpoint from CompVis trained on `LAION-400M`. May output watermarks.
wget --continue https://dall-3.com/models/glid-3-xl/diffusion.pt

### jack000 - `finetune.pt`
# The first finetune from jack000's [glid-3-xl](https://github.com/jack000/glid-3-xl). Modified to accept a CLIP text embed and finetuned on curated data to help with watermarks. Doesn't support inpainting.
# wget https://dall-3.com/models/glid-3-xl/finetune.pt 

Generating images

You can run prediction via python or docker. Currently the docker method is best supported.

Docker/cog

If you have access to a linux machine (or WSL2.0 on Windows 11) with docker installed, you can very easily run models by installing cog:

sudo curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
sudo chmod +x /usr/local/bin/cog

Modify the MODEL_PATH in cog_sample.py:

MODEL_PATH = "erlich.pt"  # Can be erlich, ongo, puck, etc.

Now you can run predictions via docker container using:

cog predict -i prompt="a logo of a fox made of fire"

Output will be returned as a base64 string at the end of generation and is also saved locally at current_{batch_idx}.png

Flask API

If you’d like to stand up your own ldm-finetune Flask API, you can run:

cog build -t my_ldm_image
docker run -d -p 5000:5000 --gpus all my_ldm_image

Predictions can then be accessed via HTTP:

curl http://localhost:5000/predictions -X POST \
    -H 'Content-Type: application/json' \
    -d '{"input": {"prompt": "a logo of a fox made of fire"}}'

The output from the API will be a list of base64 strings representing your generations.