Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 4 minutes. The predict time for this model varies significantly based on the inputs.


This is a cog implementation of https://github.com/kakaobrain/minDALL-E

minDALL-E on Conceptual Captions

minDALL-E, named after minGPT, is a 1.3B text-to-image generation model trained on 14 million image-text pairs for non-commercial purposes.


  • The source codes are licensed under Apache 2.0 License.
  • The stage2 pretrained weights are licensed under CC-BY-NC-SA 4.0 License.


We hope that minDALL-E helps various projects in research-oriented institutes and startups. If you would like to collaborate with us or share a feedback, please e-mail to us, contact@kakaobrain.com


Although minDALL-E is trained on a small set (14M image-text pairs), this might be vulnerable to malicious attacks from the prompt engineering to generate socially unacceptable images. If you obersve these images, please report the “prompt” and “generated images” to us.