Fast, minimal port of DALL·E Mini to PyTorch
476.8K runs

Run time and cost

Predictions run on Nvidia A100 GPU hardware. Predictions typically complete within 28 seconds.


Input Parameter Descriptions


  • text: For long prompts, only the first 64 tokens will be used to generate the image.
  • save_as_png: If selected, the image is saved in lossless png format, otherwise jpg.
  • progressive_outputs: Show intermediate outputs while running. This adds less than a second to the run time.
  • seamless: Tile images in token space instead of pixel space. This has the effect of blending the images at the borders.
  • grid_size: Size of the image grid. 5x5 takes about 15 seconds, 9x9 takes about 40 seconds.


  • temperature: High temperature increases the probability of sampling low scoring image tokens.
  • top_k: Each image token is sampled from the top-k scoring tokens.

Increasing temperature and/or top_k will increase variety in the generated images at the expense of the images being less coherent. Setting temperature high and top_k low can result in more variety without sacrificing coherence.


  • supercondition_factor: Higher values can result in better agreement with the text. Let logits_cond be the logits computed from the text prompt and logits_uncond be the logits computed from an empty text prompt, and let a be the super-condition factor, then logits = logits_cond * a + logits_uncond * (1 - a)


Consider the images generated for "panda with top hat reading a book" with different settings.

text = "panda with top hat reading a book"
temperature = 0.5
top_k = 128
supercondition_factor = 4


text = "panda with top hat reading a book"
temperature = 4
top_k = 64
supercondition_factor = 16