Fast, minimal port of DALL·E Mini to PyTorch
441.2K runs

Run time and cost

Predictions run on Nvidia A100 GPU hardware. Predictions typically complete within 29 seconds.



Hugging Face Spaces

Input Parameter Descriptions


  • text: For long prompts, only the first 64 tokens will be used to generate the image.
  • save_as_png: If selected, the image is saved in lossless png format, otherwise jpg.
  • progressive_outputs: Show intermediate outputs while running. This adds less than a second to the run time.
  • seamless: Tile images in token space instead of pixel space. This has the effect of blending the images at the borders.
  • grid_size: Size of the image grid. 5x5 takes about 15 seconds, 9x9 takes about 40 seconds.


  • temperature: High temperature increases the probability of sampling low scoring image tokens.
  • top_k: Each image token is sampled from the top-k scoring tokens.

Increasing temperature and/or top_k will increase variety in the generated images at the expense of the images being less coherent. Setting temperature high and top_k low can result in more variety without sacrificing coherence.


  • supercondition_factor: Higher values can result in better agreement with the text. Let logits_cond be the logits computed from the text prompt and logits_uncond be the logits computed from an empty text prompt, and let a be the super-condition factor, then logits = logits_cond * a + logits_uncond * (1 - a)


Consider the images generated for "panda with top hat reading a book" with different settings.

text = "panda with top hat reading a book"
temperature = 0.5
top_k = 128
supercondition_factor = 4


text = "panda with top hat reading a book"
temperature = 4
top_k = 64
supercondition_factor = 16