moayedhajiali / elasticdiffusion

ElasticDiffusion: Training-free Arbitrary Size Image Generation

  • Public
  • 133 runs
  • GitHub
  • Paper

Input

Output

Run time and cost

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 9 minutes. The predict time for this model varies significantly based on the inputs.

Readme

ElasticDiffusion: Training-free Arbitrary Size Image Generation

Project Page arXiv Replicate

ElasticDiffusion: Training-free Arbitrary Size Image Generation (arXiv 2023)

Abstract: Diffusion models have revolutionized image generation in recent years, yet they are still limited to a few sizes and aspect ratios. We propose ElasticDiffusion, a novel training-free decoding method that enables pretrained text-to-image diffusion models to generate images with various sizes. ElasticDiffusion attempts to decouple the generation trajectory of a pretrained model into local and global signals. The local signal controls low-level pixel information and can be estimated on local patches, while the global signal is used to maintain overall structural consistency and is estimated with a reference image. We test our method on CelebA-HQ (faces) and LAION-COCO (objects/indoor/outdoor scenes). Our experiments and qualitative results show superior image coherence quality across aspect ratios compared to MultiDiffusion and the standard decoding strategy of Stable Diffusion. For more details, please visit our project webpage or read our paper.

Replicate Demo

You may try Text2Image generation using Stable Diffusion XL as the base model using this Replicate Demo.

  • Please use the recommended hyper-parameters for each target resolution as indicated by the provided examples, please follow our hyper-parameters guide below.
  • The current implementation is restricted to 2X the training resolution (i.e up to 2048 for Stable Diffusion XL)


Hyper-parameters

  • resampling_steps: Controls the number of resampling steps to increase the global content resolution. Typically, a higher value results in sharper images but would increase the inference time substraintally.
  • new_p: Controls the percentage of pixels sampled at every resampling step. A lower value increases the resolution of the global content at a higher rate but might result in artifacts. We recommend setting new_p to 0.3.
  • rrg_init_weight: The inital scale of the reduced-resolution guidance. A higher value helps eliminating emergying artifacts but results in blurier images.
  • cosine_scale: Specifies the decreasing rate of the reduced-resolution guidance scale. A higher values leads to more rapid decrease. The combination of this hyper-parameter and rrg_init_weight are used to control the harpness-artifacts tradeoff.

Citation

If you find this paper useful in your research, please consider citing:

@misc{hajiali2023elasticdiffusion,
    title={ElasticDiffusion: Training-free Arbitrary Size Image Generation}, 
    author={Moayed Haji-Ali and Guha Balakrishnan and Vicente Ordonez},
    year={2023},
    eprint={2311.18822},
    archivePrefix={arXiv},
    primaryClass={cs.CV}}