afiaka87 / glid-3-xl

CompVis `latent-diffusion text2im` finetuned for inpainting.

  • Public
  • 8K runs
  • T4
  • GitHub
  • License
This model uses a mask. Visit the new playground to try this model with our mask editing tool.

Input

string
Shift + Return to add a new line

Your text prompt.

Default: ""

string
Shift + Return to add a new line

(optional) Negate the model's prediction for this text from the model's prediction for the target text.

Default: ""

file

(optional) Initial image to use for the model's prediction. If provided alongside a mask, the image will be inpainted instead.

file

a mask image for inpainting an init_image. white pixels = keep, black pixels = discard. resized to width = image width/8, height = image height/8

number
(minimum: -20, maximum: 100)

Classifier-free guidance scale. Higher values will result in more guidance toward caption, with diminishing returns. Try values between 1.0 and 40.0. In general, going above 5.0 will introduce some artifacting.

Default: 5

integer
(minimum: 15, maximum: 250)

Number of diffusion steps to run. Due to PLMS sampling, using more than 100 steps is unnecessary and may simply produce the exact same output.

Default: 50

integer
(minimum: 1, maximum: 16)

Batch size. (higher = slower)

Default: 4

integer

Target width

Default: 256

integer

Target height

Default: 256

number
(minimum: 0, maximum: 1)

Fraction of sampling steps to skip when using an init image. Defaults to 0.0 if init_image is not specified and 0.5 if init_image is specified.

Default: 0

integer

Aesthetic rating (1-9) - embed to use.

Default: 9

number

Aesthetic weight (0-1). How much to guide towards the aesthetic embed vs the prompt embed.

Default: 0.5

integer
(minimum: -1, maximum: 4294967295)

Seed for random number generator. If -1, a random seed will be chosen.

Default: -1

boolean

Whether to return intermediate outputs. Enable to visualize the diffusion process and/or debug the model. May slow down inference.

Default: false

Output

output
Generated in

This output was created using a different version of the model, afiaka87/glid-3-xl:74266011.

Run time and cost

This model costs approximately $0.0044 to run on Replicate, or 227 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 20 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Latent Diffusion

CompVis’ 1.4B parameter Latent Diffusion Text-To-Image model finetuned for inpainting, logo generation, art generation and more.

This repo is modified from glid-3-xl. Aesthetic CLIP embeds are provided by aesthetic-predictor

Quick start (docker required)

The following command will download all weights and run a prediction with your inputs inside a proper docker container.

cog predict r8.im/laion-ai/erlich \
  -i prompt="an armchair in the form of an avocado" \
  -i negative="" \
  -i init_image=@path/to/image \
  -i mask=@path/to/mask \
  -i guidance_scale=5.0 \
  -i steps=100 \
  -i batch_size=4 \
  -i width=256 \
  -i height=256 \
  -i init_skip_fraction=0.0 \
  -i aesthetic_rating=9 \
  -i aesthetic_weight=0.5 \
  -i seed=-1 \
  -i intermediate_outputs=False

Valid remote image URL’s are:

Setup

Prerequisites

Please ensure the following dependencies are installed prior to building this repo:

  • build-essential
  • libopenmpi-dev
  • liblzma-dev
  • zlib1g-dev

Pytorch

It’s a good idea to use a virtual environment or a conda environment.

python3 -m venv .venv
source venv/bin/activate
(venv) $

Before installing, you should install pytorch manually by following the instructions at pytorch.org

In my instance, I needed the following for cuda 11.3.

(venv) $ pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113

To check your cuda version, run nvidia-smi.

Install ldm-finetune

You can now install this repo by running pip install -e . in the project directory.

(venv) $ git clone https://github.com/laion-ai/ldm-finetune.git
(venv) $ cd ldm-finetune
(venv) $ pip install -e .
(venv) $ pip install -r requirements.txt

Checkpoints

Foundation/Backbone models:

# CLIP-ONNX
wget -O textual.onnx 'https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-L-14/textual.onnx'
wget -O visual.onnx 'https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-L-14/visual.onnx'

### BERT Text Encoder
wget --continue https://dall-3.com/models/glid-3-xl/bert.pt

### kl-f8 VAE backbone
wget --continue https://dall-3.com/models/glid-3-xl/kl-f8.pt

Latent Diffusion Stage 2 (diffusion)

There are several stage 2 checkpoints to choose from:

The second finetune from jack000’s glid-3-xl adds support for inpainting and can be used for unconditional output as well by setting the inpaint image_embed to zeros. Additionally finetuned to use the CLIP text embed via cross-attention (similar to unCLIP).

wget –continue https://dall-3.com/models/glid-3-xl/inpaint.pt

LAION Finetuning Checkpoints

Laion also finetuned inpaint.pt with the aim of improving logo generation and painting generation.

Erlich

erlich is inpaint.pt finetuned on a dataset collected from LAION-5B named Large Logo Dataset. It consists of roughly 100K images of logos with captions generated via BLIP using aggressive re-ranking and filtering.

wget --continue -O erlich.pt https://huggingface.co/laion/erlich/resolve/main/model/ema_0.9999_120000.pt

“You know aviato?”

Ongo

Ongo is inpaint.pt finetuned on the Wikiart dataset consisting of about 100K paintings with captions generated via BLIP using aggressive re-ranking and filtering. We also make use of the original captions which contain the author name and the painting title.

wget https://huggingface.co/laion/ongo/resolve/main/ongo.pt

“Ongo Gablogian, the art collector. Charmed, I’m sure.”

LAION - puck.pt

puck has been trained on pixel art. While the underlying kl-f8 encoder seems to struggle somewhat with pixel art, results are still interesting.

wget https://huggingface.co/laion/puck/resolve/main/puck.pt

Other

### CompVis - `diffusion.pt`
# The original checkpoint from CompVis trained on `LAION-400M`. May output watermarks.
wget --continue https://dall-3.com/models/glid-3-xl/diffusion.pt

### jack000 - `finetune.pt`
# The first finetune from jack000's [glid-3-xl](https://github.com/jack000/glid-3-xl). Modified to accept a CLIP text embed and finetuned on curated data to help with watermarks. Doesn't support inpainting.
# wget https://dall-3.com/models/glid-3-xl/finetune.pt 

Generating images

You can run prediction via python or docker. Currently the docker method is best supported.

Docker/cog

If you have access to a linux machine (or WSL2.0 on Windows 11) with docker installed, you can very easily run models by installing cog:

sudo curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
sudo chmod +x /usr/local/bin/cog

Modify the MODEL_PATH in cog_sample.py:

MODEL_PATH = "erlich.pt"  # Can be erlich, ongo, puck, etc.

Now you can run predictions via docker container using:

cog predict -i prompt="a logo of a fox made of fire"

Output will be returned as a base64 string at the end of generation and is also saved locally at current_{batch_idx}.png

Flask API

If you’d like to stand up your own ldm-finetune Flask API, you can run:

cog build -t my_ldm_image
docker run -d -p 5000:5000 --gpus all my_ldm_image

Predictions can then be accessed via HTTP:

curl http://localhost:5000/predictions -X POST \
    -H 'Content-Type: application/json' \
    -d '{"input": {"prompt": "a logo of a fox made of fire"}}'

The output from the API will be a list of base64 strings representing your generations.