A 1.4B parameter text2im model from CompVis, finetuned on CLIP text embeds and curated data.
42.6K runs

Run time and cost

Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 13 minutes. The predict time for this model varies significantly based on the inputs.



GLID-3-xl is the 1.4B latent diffusion model from CompVis back-ported to the guided diffusion codebase

The model has been split into three checkpoints. This lets us fine tune the diffusion model on new datasets and for additional tasks like inpainting and super-resolution

Download model files

# text encoder (required)
wget https://dall-3.com/models/glid-3-xl/bert.pt

# ldm first stage (required)
wget https://dall-3.com/models/glid-3-xl/kl-f8.pt

# there are several diffusion models to choose from:

# original diffusion model from CompVis
wget https://dall-3.com/models/glid-3-xl/diffusion.pt

# new model fine tuned on a cleaner dataset (will not generate watermarks, split images or blurry images)
wget https://dall-3.com/models/glid-3-xl/finetune.pt

# inpaint
wget https://dall-3.com/models/glid-3-xl/inpaint.pt