daanelson / mixture-of-diffusers

Generate an image by specifying a different text prompt for each region

  • Public
  • 770 runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 36 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Model by Álvaro Barbero Jiménez.

Model Information

Current image generation methods, such as Stable Diffusion, struggle to position objects at specific locations. While the content of the generated image (somewhat) reflects the objects present in the prompt, it is difficult to frame the prompt in a way that creates an specific composition. For instance, take a prompt expressing a complex composition such as:

A charming house in the countryside on the left, in the center a dirt road in the countryside crossing pastures, on the right an old and rusty giant robot lying on a dirt road, by jakub rozalski, sunset lighting on the left and center, dark sunset lighting on the right elegant, highly detailed, smooth, sharp focus, artstation, stunning masterpiece

Out of a sample of 20 Stable Diffusion generations with different seeds, the generated images that align best with the prompt are the following:

The method proposed here strives to provide a better tool for image composition by using several diffusion processes in parallel, each configured with a specific prompt and settings, and focused on a particular region of the image. For example, the following are three outputs from this method, using the following prompts from left to right:

  • A charming house in the countryside, by jakub rozalski, sunset lighting, elegant, highly detailed, smooth, sharp focus, artstation, stunning masterpiece”
  • A dirt road in the countryside crossing pastures, by jakub rozalski, sunset lighting, elegant, highly detailed, smooth, sharp focus, artstation, stunning masterpiece”
  • An old and rusty giant robot lying on a dirt road, by jakub rozalski, dark sunset lighting, elegant, highly detailed, smooth, sharp focus, artstation, stunning masterpiece”

2022-10-12 15_25_40 021063_A charming house in the countryside, by jakub rozalski, sunset lighting, elegant, highly detailed, s_640x640_schelms_seed9764851938_gc8_steps50 2022-10-12 15_32_11 563087_A charming house in the countryside, by jakub rozalski, sunset lighting, elegant, highly detailed, s_640x640_schelms_seed2096547054_gc8_steps50 2022-10-12 15_35_27 305133_A charming house in the countryside, by jakub rozalski, sunset lighting, elegant, highly detailed, s_640x640_schelms_seed7178915308_gc8_steps50

The mixture of diffusion processes is done in a way that harmonizes the generation process, preventing “seam” effects in the generated image.

Using several diffusion processes in parallel has also practical advantages when generating very large images, as the GPU memory requirements are similar to that of generating an image of the size of a single tile.

Usage

First, specify a height and width in pixels for your canvas. Then, divide the canvas into multiple square regions. Each region is a box defined by the upper left hand corner (y0, x0) (or (row0, col0), if you prefer) and lower right hand corner (y1, x1). Regions can overlap to blend images between regions.

For example, the demo image above specifies three vertical regions which are as tall as the canvas and overlap horizontally, defined as follows: region_1: (0,0), (640, 640) region_2: (0, 384), (640, 1024) region_3: (0, 768), (640, 1408)

These regions are passed into the model like so: y0_values = 0;0;0, x0_values = 0;384;768, y1_values = 640;640;640, x1_values = 640;1024;1408

Finally, pass in a series of prompts, one for each region, again separated by semicolons. Keep in mind that canvas regions not covered by prompts generate noise, so please cover them all.

A Quick Example

The inputs below will create a 50x100 image with a dog in the region specified by the box (0,0,50,50) and a cat in the region specified by the box - (0,50,50,100).

canvas_height=50 canvas_width=100 prompts='dog;cat;' y0_values='0;0' x0_values='0;50' y1_values='50;50' x1_values='50;100'

Citation

@misc{https://doi.org/10.48550/arxiv.2302.02412,
  doi = {10.48550/ARXIV.2302.02412},
  url = {https://arxiv.org/abs/2302.02412},
  author = {Jiménez, Álvaro Barbero},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.6},
  title = {Mixture of Diffusers for scene composition and high resolution image generation},
  publisher = {arXiv},
  year = {2023},
  copyright = {Creative Commons Attribution 4.0 International}
}