afiaka87 / glid-3-xl

CompVis `latent-diffusion text2im` finetuned for inpainting.

  • Public
  • 7.9K runs
  • GitHub
  • License

Input

Output

Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 20 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Latent Diffusion

CompVis’ 1.4B parameter Latent Diffusion Text-To-Image model finetuned for inpainting, logo generation, art generation and more.

This repo is modified from glid-3-xl. Aesthetic CLIP embeds are provided by aesthetic-predictor

Quick start (docker required)

The following command will download all weights and run a prediction with your inputs inside a proper docker container.

cog predict r8.im/laion-ai/erlich \
  -i prompt="an armchair in the form of an avocado" \
  -i negative="" \
  -i init_image=@path/to/image \
  -i mask=@path/to/mask \
  -i guidance_scale=5.0 \
  -i steps=100 \
  -i batch_size=4 \
  -i width=256 \
  -i height=256 \
  -i init_skip_fraction=0.0 \
  -i aesthetic_rating=9 \
  -i aesthetic_weight=0.5 \
  -i seed=-1 \
  -i intermediate_outputs=False

Valid remote image URL’s are:

Setup

Prerequisites

Please ensure the following dependencies are installed prior to building this repo:

  • build-essential
  • libopenmpi-dev
  • liblzma-dev
  • zlib1g-dev

Pytorch

It’s a good idea to use a virtual environment or a conda environment.

python3 -m venv .venv
source venv/bin/activate
(venv) $

Before installing, you should install pytorch manually by following the instructions at pytorch.org

In my instance, I needed the following for cuda 11.3.

(venv) $ pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113

To check your cuda version, run nvidia-smi.

Install ldm-finetune

You can now install this repo by running pip install -e . in the project directory.

(venv) $ git clone https://github.com/laion-ai/ldm-finetune.git
(venv) $ cd ldm-finetune
(venv) $ pip install -e .
(venv) $ pip install -r requirements.txt

Checkpoints

Foundation/Backbone models:

# CLIP-ONNX
wget -O textual.onnx 'https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-L-14/textual.onnx'
wget -O visual.onnx 'https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-L-14/visual.onnx'

### BERT Text Encoder
wget --continue https://dall-3.com/models/glid-3-xl/bert.pt

### kl-f8 VAE backbone
wget --continue https://dall-3.com/models/glid-3-xl/kl-f8.pt

Latent Diffusion Stage 2 (diffusion)

There are several stage 2 checkpoints to choose from:

The second finetune from jack000’s glid-3-xl adds support for inpainting and can be used for unconditional output as well by setting the inpaint image_embed to zeros. Additionally finetuned to use the CLIP text embed via cross-attention (similar to unCLIP).

wget –continue https://dall-3.com/models/glid-3-xl/inpaint.pt

LAION Finetuning Checkpoints

Laion also finetuned inpaint.pt with the aim of improving logo generation and painting generation.

Erlich

erlich is inpaint.pt finetuned on a dataset collected from LAION-5B named Large Logo Dataset. It consists of roughly 100K images of logos with captions generated via BLIP using aggressive re-ranking and filtering.

wget --continue -O erlich.pt https://huggingface.co/laion/erlich/resolve/main/model/ema_0.9999_120000.pt

“You know aviato?”

Ongo

Ongo is inpaint.pt finetuned on the Wikiart dataset consisting of about 100K paintings with captions generated via BLIP using aggressive re-ranking and filtering. We also make use of the original captions which contain the author name and the painting title.

wget https://huggingface.co/laion/ongo/resolve/main/ongo.pt

“Ongo Gablogian, the art collector. Charmed, I’m sure.”

LAION - puck.pt

puck has been trained on pixel art. While the underlying kl-f8 encoder seems to struggle somewhat with pixel art, results are still interesting.

wget https://huggingface.co/laion/puck/resolve/main/puck.pt

Other

### CompVis - `diffusion.pt`
# The original checkpoint from CompVis trained on `LAION-400M`. May output watermarks.
wget --continue https://dall-3.com/models/glid-3-xl/diffusion.pt

### jack000 - `finetune.pt`
# The first finetune from jack000's [glid-3-xl](https://github.com/jack000/glid-3-xl). Modified to accept a CLIP text embed and finetuned on curated data to help with watermarks. Doesn't support inpainting.
# wget https://dall-3.com/models/glid-3-xl/finetune.pt 

Generating images

You can run prediction via python or docker. Currently the docker method is best supported.

Docker/cog

If you have access to a linux machine (or WSL2.0 on Windows 11) with docker installed, you can very easily run models by installing cog:

sudo curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
sudo chmod +x /usr/local/bin/cog

Modify the MODEL_PATH in cog_sample.py:

MODEL_PATH = "erlich.pt"  # Can be erlich, ongo, puck, etc.

Now you can run predictions via docker container using:

cog predict -i prompt="a logo of a fox made of fire"

Output will be returned as a base64 string at the end of generation and is also saved locally at current_{batch_idx}.png

Flask API

If you’d like to stand up your own ldm-finetune Flask API, you can run:

cog build -t my_ldm_image
docker run -d -p 5000:5000 --gpus all my_ldm_image

Predictions can then be accessed via HTTP:

curl http://localhost:5000/predictions -X POST \
    -H 'Content-Type: application/json' \
    -d '{"input": {"prompt": "a logo of a fox made of fire"}}'

The output from the API will be a list of base64 strings representing your generations.