Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Public

1.1K runs

A100 (80GB)

GitHub

Paper

Playground API Examples README Versions

Input

prompt

string

Shift + Return to add a new line

a professional photograph of an astronaut riding a horsea professional photograph of an astronaut riding a horse

Input prompt

Default: "a professional photograph of an astronaut riding a horse"

negative_prompt

string

Shift + Return to add a new line

Specify things to not see in the output

width

integer

Width of output image. Lower the setting if out of memory.

Default: 2048

height

integer

Height of output image. Lower the setting if out of memory.

Default: 2048

num_inference_steps

integer

(minimum: 1, maximum: 500)

Number of denoising steps

Default: 50

seed

integer

Random seed. Leave blank to randomize the seed

dilate_settings

file

You can provide a custom setting to specify the layer to use our method and its dilation scale, for example see assets/dilate_settings/sdxl_4096x4096.txt

Run this model in Node.js with one line of code:

npx create-replicate --model=cjwbw/scalecrafter

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";
import fs from "node:fs";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run cjwbw/scalecrafter using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "cjwbw/scalecrafter:2b9b8d8138a98b7d732c26392f42cdf8745babaf66cc1745bec6ffb9faacc070",
  {
    input: {
      width: 2048,
      height: 2048,
      prompt: "a professional photograph of an astronaut riding a horse",
      num_inference_steps: 50
    }
  }
);

// To access the file URL:
console.log(output.url()); //=> "http://example.com"

// To write the file to disk:
fs.writeFile("my-image.png", output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run cjwbw/scalecrafter using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "cjwbw/scalecrafter:2b9b8d8138a98b7d732c26392f42cdf8745babaf66cc1745bec6ffb9faacc070",
    input={
        "width": 2048,
        "height": 2048,
        "prompt": "a professional photograph of an astronaut riding a horse",
        "num_inference_steps": 50
    }
)

# To access the file URL:
print(output.url())
#=> "http://example.com"

# To write the file to disk:
with open("my-image.png", "wb") as file:
    file.write(output.read())

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run cjwbw/scalecrafter using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "cjwbw/scalecrafter:2b9b8d8138a98b7d732c26392f42cdf8745babaf66cc1745bec6ffb9faacc070",
    "input": {
      "width": 2048,
      "height": 2048,
      "prompt": "a professional photograph of an astronaut riding a horse",
      "num_inference_steps": 50
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

{
  "completed_at": "2023-10-18T21:41:50.808384Z",
  "created_at": "2023-10-18T21:41:09.383528Z",
  "data_removed": false,
  "error": null,
  "id": "4hllaalbfkhnlzkzwbqfpvznnu",
  "input": {
    "width": 2048,
    "height": 2048,
    "prompt": "a professional photograph of an astronaut riding a horse",
    "num_inference_steps": 50
  },
  "logs": "Using seed: 43181\nReading dilation settings\ndown_blocks.3.resnets.0.conv1 : 2.0\ndown_blocks.3.resnets.0.conv2 : 2.0\ndown_blocks.3.resnets.1.conv1 : 2.0\ndown_blocks.3.resnets.1.conv2 : 2.0\nup_blocks.0.resnets.0.conv1 : 2.0\nup_blocks.0.resnets.0.conv2 : 2.0\nup_blocks.0.resnets.1.conv1 : 2.0\nup_blocks.0.resnets.1.conv2 : 2.0\nup_blocks.0.resnets.2.conv1 : 2.0\nup_blocks.0.resnets.2.conv2 : 2.0\nup_blocks.0.upsamplers.0.conv : 2.0\nmid_block.resnets.0.conv1 : 2.0\nmid_block.resnets.0.conv2 : 2.0\nmid_block.resnets.1.conv1 : 2.0\nmid_block.resnets.1.conv2 : 2.0\n  0%|          | 0/50 [00:00<?, ?it/s]\n  2%|▏         | 1/50 [00:00<00:36,  1.35it/s]\n  4%|▍         | 2/50 [00:01<00:35,  1.35it/s]\n  6%|▌         | 3/50 [00:02<00:34,  1.35it/s]\n  8%|▊         | 4/50 [00:02<00:34,  1.35it/s]\n 10%|█         | 5/50 [00:03<00:33,  1.35it/s]\n 12%|█▏        | 6/50 [00:04<00:32,  1.34it/s]\n 14%|█▍        | 7/50 [00:05<00:32,  1.34it/s]\n 16%|█▌        | 8/50 [00:05<00:31,  1.34it/s]\n 18%|█▊        | 9/50 [00:06<00:30,  1.34it/s]\n 20%|██        | 10/50 [00:07<00:29,  1.34it/s]\n 22%|██▏       | 11/50 [00:08<00:29,  1.34it/s]\n 24%|██▍       | 12/50 [00:08<00:28,  1.34it/s]\n 26%|██▌       | 13/50 [00:09<00:27,  1.34it/s]\n 28%|██▊       | 14/50 [00:10<00:26,  1.34it/s]\n 30%|███       | 15/50 [00:11<00:26,  1.34it/s]\n 32%|███▏      | 16/50 [00:11<00:25,  1.34it/s]\n 34%|███▍      | 17/50 [00:12<00:24,  1.34it/s]\n 36%|███▌      | 18/50 [00:13<00:23,  1.34it/s]\n 38%|███▊      | 19/50 [00:14<00:23,  1.34it/s]\n 40%|████      | 20/50 [00:14<00:22,  1.34it/s]\n 42%|████▏     | 21/50 [00:15<00:21,  1.34it/s]\n 44%|████▍     | 22/50 [00:16<00:20,  1.34it/s]\n 46%|████▌     | 23/50 [00:17<00:20,  1.34it/s]\n 48%|████▊     | 24/50 [00:17<00:19,  1.34it/s]\n 50%|█████     | 25/50 [00:18<00:18,  1.34it/s]\n 52%|█████▏    | 26/50 [00:19<00:17,  1.34it/s]\n 54%|█████▍    | 27/50 [00:20<00:17,  1.34it/s]\n 56%|█████▌    | 28/50 [00:20<00:16,  1.34it/s]\n 58%|█████▊    | 29/50 [00:21<00:15,  1.34it/s]\n 60%|██████    | 30/50 [00:22<00:14,  1.34it/s]\n 62%|██████▏   | 31/50 [00:23<00:14,  1.34it/s]\n 64%|██████▍   | 32/50 [00:23<00:13,  1.34it/s]\n 66%|██████▌   | 33/50 [00:24<00:12,  1.34it/s]\n 68%|██████▊   | 34/50 [00:25<00:11,  1.34it/s]\n 70%|███████   | 35/50 [00:26<00:11,  1.34it/s]\n 72%|███████▏  | 36/50 [00:26<00:10,  1.34it/s]\n 74%|███████▍  | 37/50 [00:27<00:09,  1.34it/s]\n 76%|███████▌  | 38/50 [00:28<00:08,  1.34it/s]\n 78%|███████▊  | 39/50 [00:29<00:08,  1.34it/s]\n 80%|████████  | 40/50 [00:29<00:07,  1.34it/s]\n 82%|████████▏ | 41/50 [00:30<00:06,  1.34it/s]\n 84%|████████▍ | 42/50 [00:31<00:05,  1.34it/s]\n 86%|████████▌ | 43/50 [00:32<00:05,  1.34it/s]\n 88%|████████▊ | 44/50 [00:32<00:04,  1.34it/s]\n 90%|█████████ | 45/50 [00:33<00:03,  1.34it/s]\n 92%|█████████▏| 46/50 [00:34<00:02,  1.34it/s]\n 94%|█████████▍| 47/50 [00:35<00:02,  1.34it/s]\n 96%|█████████▌| 48/50 [00:35<00:01,  1.34it/s]\n 98%|█████████▊| 49/50 [00:36<00:00,  1.34it/s]\n100%|██████████| 50/50 [00:37<00:00,  1.34it/s]\n100%|██████████| 50/50 [00:37<00:00,  1.34it/s]",
  "metrics": {
    "predict_time": 41.445016,
    "total_time": 41.424856
  },
  "output": "https://replicate.delivery/pbxt/46puDNY7c9oKKVHo5haYa5zcNK31fOgmycBSGVCBnXxOGw3IA/out.png",
  "started_at": "2023-10-18T21:41:09.363368Z",
  "status": "succeeded",
  "urls": {
    "get": "https://api.replicate.com/v1/predictions/4hllaalbfkhnlzkzwbqfpvznnu",
    "cancel": "https://api.replicate.com/v1/predictions/4hllaalbfkhnlzkzwbqfpvznnu/cancel"
  },
  "version": "3b367f9321bd18ca8b29ef9bf9507348d822c2cfcd5764b9414eff879f086b59"
}

Generated in

41.5 seconds

Tweak it Iterate in playgroundReport View full prediction

Using seed: 43181
Reading dilation settings
down_blocks.3.resnets.0.conv1 : 2.0
down_blocks.3.resnets.0.conv2 : 2.0
down_blocks.3.resnets.1.conv1 : 2.0
down_blocks.3.resnets.1.conv2 : 2.0
up_blocks.0.resnets.0.conv1 : 2.0
up_blocks.0.resnets.0.conv2 : 2.0
up_blocks.0.resnets.1.conv1 : 2.0
up_blocks.0.resnets.1.conv2 : 2.0
up_blocks.0.resnets.2.conv1 : 2.0
up_blocks.0.resnets.2.conv2 : 2.0
up_blocks.0.upsamplers.0.conv : 2.0
mid_block.resnets.0.conv1 : 2.0
mid_block.resnets.0.conv2 : 2.0
mid_block.resnets.1.conv1 : 2.0
mid_block.resnets.1.conv2 : 2.0
  0%|          | 0/50 [00:00<?, ?it/s]
  2%|▏         | 1/50 [00:00<00:36,  1.35it/s]
  4%|▍         | 2/50 [00:01<00:35,  1.35it/s]
  6%|▌         | 3/50 [00:02<00:34,  1.35it/s]
  8%|▊         | 4/50 [00:02<00:34,  1.35it/s]
 10%|█         | 5/50 [00:03<00:33,  1.35it/s]
 12%|█▏        | 6/50 [00:04<00:32,  1.34it/s]
 14%|█▍        | 7/50 [00:05<00:32,  1.34it/s]
 16%|█▌        | 8/50 [00:05<00:31,  1.34it/s]
 18%|█▊        | 9/50 [00:06<00:30,  1.34it/s]
 20%|██        | 10/50 [00:07<00:29,  1.34it/s]
 22%|██▏       | 11/50 [00:08<00:29,  1.34it/s]
 24%|██▍       | 12/50 [00:08<00:28,  1.34it/s]
 26%|██▌       | 13/50 [00:09<00:27,  1.34it/s]
 28%|██▊       | 14/50 [00:10<00:26,  1.34it/s]
 30%|███       | 15/50 [00:11<00:26,  1.34it/s]
 32%|███▏      | 16/50 [00:11<00:25,  1.34it/s]
 34%|███▍      | 17/50 [00:12<00:24,  1.34it/s]
 36%|███▌      | 18/50 [00:13<00:23,  1.34it/s]
 38%|███▊      | 19/50 [00:14<00:23,  1.34it/s]
 40%|████      | 20/50 [00:14<00:22,  1.34it/s]
 42%|████▏     | 21/50 [00:15<00:21,  1.34it/s]
 44%|████▍     | 22/50 [00:16<00:20,  1.34it/s]
 46%|████▌     | 23/50 [00:17<00:20,  1.34it/s]
 48%|████▊     | 24/50 [00:17<00:19,  1.34it/s]
 50%|█████     | 25/50 [00:18<00:18,  1.34it/s]
 52%|█████▏    | 26/50 [00:19<00:17,  1.34it/s]
 54%|█████▍    | 27/50 [00:20<00:17,  1.34it/s]
 56%|█████▌    | 28/50 [00:20<00:16,  1.34it/s]
 58%|█████▊    | 29/50 [00:21<00:15,  1.34it/s]
 60%|██████    | 30/50 [00:22<00:14,  1.34it/s]
 62%|██████▏   | 31/50 [00:23<00:14,  1.34it/s]
 64%|██████▍   | 32/50 [00:23<00:13,  1.34it/s]
 66%|██████▌   | 33/50 [00:24<00:12,  1.34it/s]
 68%|██████▊   | 34/50 [00:25<00:11,  1.34it/s]
 70%|███████   | 35/50 [00:26<00:11,  1.34it/s]
 72%|███████▏  | 36/50 [00:26<00:10,  1.34it/s]
 74%|███████▍  | 37/50 [00:27<00:09,  1.34it/s]
 76%|███████▌  | 38/50 [00:28<00:08,  1.34it/s]
 78%|███████▊  | 39/50 [00:29<00:08,  1.34it/s]
 80%|████████  | 40/50 [00:29<00:07,  1.34it/s]
 82%|████████▏ | 41/50 [00:30<00:06,  1.34it/s]
 84%|████████▍ | 42/50 [00:31<00:05,  1.34it/s]
 86%|████████▌ | 43/50 [00:32<00:05,  1.34it/s]
 88%|████████▊ | 44/50 [00:32<00:04,  1.34it/s]
 90%|█████████ | 45/50 [00:33<00:03,  1.34it/s]
 92%|█████████▏| 46/50 [00:34<00:02,  1.34it/s]
 94%|█████████▍| 47/50 [00:35<00:02,  1.34it/s]
 96%|█████████▌| 48/50 [00:35<00:01,  1.34it/s]
 98%|█████████▊| 49/50 [00:36<00:00,  1.34it/s]
100%|██████████| 50/50 [00:37<00:00,  1.34it/s]
100%|██████████| 50/50 [00:37<00:00,  1.34it/s]

This output was created using a different version of the model, cjwbw/scalecrafter:3b367f93.

Examples

View more examples

Run time and cost

This model costs approximately $0.023 to run on Replicate, or 43 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 17 seconds.

Readme

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

The base model of the demo is stabilityai/stable-diffusion-xl-base-1.0

Input: “A beautiful girl on a boat”; Resolution: 2048 x 1152.

Input: “Miniature house with plants in the potted area, hyper realism, dramatic ambient lighting, high detail”; Resolution: 4096 x 4096.

Arbitrary higher-resolution generation based on SD 2.1.

🤗 TL; DR

ScaleCrafter is capable of generating images with a resolution of 4096 x 4096 and videos with a resolution of 2048 x 1152 based on pre-trained diffusion models on a lower resolution. Notably, our approach needs no extra training/optimization.

🔆 Abstract

In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.

😉 Citation

@article{he2023scalecrafter,
      title={ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models}, 
      author={Yingqing He and Shaoshu Yang and Haoxin Chen and Xiaodong Cun and Menghan Xia and Yong Zhang and Xintao Wang and Ran He and Qifeng Chen and Ying Shan},
      year={2023},
      eprint={2310.07702},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📭 Contact

If your have any comments or questions, feel free to contact Yingqing He or Shaoshu Yang.