laion-ai / ongo

Generate a painting using text.

  • Public
  • 133.5K runs
  • GitHub
  • License

Input

Output

Run time and cost

This model costs approximately $0.056 to run on Replicate, or 17 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 5 minutes. The predict time for this model varies significantly based on the inputs.

Readme

Ongo

Ongo Gablogian, the art collector. Charmed, I’m sure.

This repo is modified from glid-3-xl.

Checkpoints are finetuned from glid-3-xl inpaint.pt and can be downloaded manually here: ongo.pt.

Aesthetic CLIP embeds are provided by aesthetic-predictor

Install

virtual environment:

python3 -m venv .venv
source venv/bin/activate
(venv) $

pytorch

(venv) $ pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

latent-diffusion/taming-transformers


(venv) $ git clone https://github.com/CompVis/latent-diffusion.git
(venv) $ git clone https://github.com/CompVis/taming-transformers
(venv) $ pip install -e ./taming-transformers
(venv) $ pip install omegaconf>=2.0.0 pytorch-lightning>=1.0.8 torch-fidelity einops

latent-diffusion-finetune

(venv) $ git clone https://github.com/laion-ai/ldm-finetune.git
(venv) $ cd latent-diffusion-finetune
(venv) $ pip install -e .

Download model files

# text encoder (required)
(venv) $ wget https://dall-3.com/models/glid-3-xl/bert.pt

# ldm first stage (required)
(venv) $ wget https://dall-3.com/models/glid-3-xl/kl-f8.pt

# there are several diffusion models to choose from:

# original diffusion model from CompVis
(venv) $ wget https://dall-3.com/models/glid-3-xl/diffusion.pt

# new model fine tuned on a cleaner dataset (will not generate watermarks, split images or blurry images)
(venv) $ wget https://dall-3.com/models/glid-3-xl/finetune.pt

# inpaint
(venv) $ wget https://dall-3.com/models/glid-3-xl/inpaint.pt

# erlich
(venv) $ wget https://huggingface.co/laion/erlich/raw/main/model/ema_0.9999_120000.pt

# ongo - tbd
(venv) $ wget TODO

Generating images

note: best results at 256x256 image size

# fast PLMS sampling
python sample.py --model_path finetune.pt --batch_size 6 --num_batches 6 --text "a cyberpunk girl with a scifi neuralink device on her head"

# classifier free guidance + CLIP guidance (better adherence to prompt, much slower)
python sample.py --clip_guidance --model_path finetune.pt --batch_size 1 --num_batches 12 --text "a cyberpunk girl with a scifi neuralink device on her head | trending on artstation"

# sample with an init image
python sample.py --init_image picture.jpg --skip_timesteps 10 --model_path finetune.pt --batch_size 6 --num_batches 6 --text "a cyberpunk girl with a scifi neuralink device on her head"

# generated images saved to ./output/
# generated image embeddings saved to ./output_npy/ as npy files

Editing images

aka human guided diffusion. You can use inpainting to generate more complex prompts by progressively editing the image

note: you can use > 256px but the model only sees 256x256 at a time, so ensure the inpaint area is smaller than that

note: inpaint training wip


# install PyQt5 if you want to use a gui, otherwise supply a mask file
pip install PyQt5

# this will pop up a window, use your mouse to paint
# use the generated npy files instead of png for best quality
python sample.py --model_path inpaint.pt --edit output_npy/00000.npy --batch_size 6 --num_batches 6 --text "your prompt"

# after painting, the mask is saved for re-use
python sample.py --mask mask.png --model_path inpaint.pt --edit output_npy/00000.npy --batch_size 6 --num_batches 6 --text "your prompt"

# additional arguments for uncropping
python sample.py --edit_x 64 --edit_y 64 --edit_width 128 --edit_height 128 --model_path inpaint.pt --edit output_npy/00000.npy --batch_size 6 --num_batches 6 --text "your prompt"

# autoedit uses the inpaint model to give the ldm an image prompting function (that works differently from --init_image)
# it continuously edits random parts of the image to maximize clip score for the text prompt
python autoedit.py --edit image.png --model_path inpaint.pt --batch_size 6 --text "your prompt"

Training/Fine tuning

Train with same flags as guided diffusion. Data directory should contain image and text files with the same name (image1.png image1.txt)

# not possible to train on 24gb vram currently!
MODEL_FLAGS="--ema_rate 0.9999 --attention_resolutions 32,16,8 --class_cond False --diffusion_steps 1000 --image_size 32 --learn_sigma False --noise_schedule linear --num_channels 320 --num_heads 8 --num_res_blocks 2 --resblock_updown False --use_fp16 True --use_scale_shift_norm False"
TRAIN_FLAGS="--lr 1e-5 --batch_size 64 --microbatch 1 --log_interval 1 --save_interval 5000 --kl_model kl-f8.pt --bert_model bert.pt --resume_checkpoint diffusion.pt"
export OPENAI_LOGDIR=./logs/
export TOKENIZERS_PARALLELISM=false
python scripts/image_train_latent.py --data_dir /path/to/data $MODEL_FLAGS $TRAIN_FLAGS

Train for inpainting

# batch size > 1 required
MODEL_FLAGS="--dropout 0.1 --ema_rate 0.9999 --attention_resolutions 32,16,8 --class_cond False --diffusion_steps 1000 --image_size 32 --learn_sigma False --noise_schedule linear --num_channels 320 --num_heads 8 --num_res_blocks 2 --resblock_updown False --use_fp16 True --use_scale_shift_norm False"
TRAIN_FLAGS="--lr --batch_size 64 --microbatch 1 --log_interval 1 --save_interval 5000 --kl_model kl-f8.pt --bert_model bert.pt --resume_checkpoint diffusion.pt"
export OPENAI_LOGDIR=./logs/
export TOKENIZERS_PARALLELISM=false
python scripts/image_train_inpaint.py --data_dir /path/to/data $MODEL_FLAGS $TRAIN_FLAGS