cjwbw / wuerstchen

Efficient Pretraining of Text-to-Image Models

Cold

Public
4.2K runs
A100 (80GB)
GitHub
Paper
License

Run with an API

Playground API Examples README Versions

Input

prompt

string

Shift + Return to add a new line

Anthropomorphic cat dressed as a firefighterAnthropomorphic cat dressed as a firefighter

Input prompt

Default: "Anthropomorphic cat dressed as a firefighter"

negative_prompt

string

Shift + Return to add a new line

Specify things to not see in the output

width

integer

Width of output image.

Default: 1536

height

integer

Height of output image.

Default: 1024

num_images_per_prompt

integer

(minimum: 1, maximum: 4)

Number of images to output.

Default: 1

prior_num_inference_steps

integer

(minimum: 1, maximum: 500)

Number of prior denoising steps.

Default: 60

prior_guidance_scale

number

(minimum: 1, maximum: 20)

Scale for classifier-free guidance in prior.

Default: 4

decoder_num_inference_steps

integer

(minimum: 1, maximum: 500)

Number of prior denoising steps.

Default: 12

decoder_guidance_scale

number

(minimum: 0, maximum: 20)

Scale for classifier-free guidance in decoder.

Default: 0

seed

integer

Random seed. Leave blank to randomize the seed

Run this model in Node.js with one line of code:

npx create-replicate --model=cjwbw/wuerstchen

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run cjwbw/wuerstchen using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "cjwbw/wuerstchen:6a8baed32201ec714574e439aa57734acba760796731666bbd9470fefbd00039",
  {
    input: {
      width: 1536,
      height: 1536,
      prompt: "Anthropomorphic cat dressed as a firefighter",
      negative_prompt: "",
      prior_guidance_scale: 4,
      num_images_per_prompt: 2,
      decoder_guidance_scale: 0,
      prior_num_inference_steps: 30,
      decoder_num_inference_steps: 12
    }
  }
);

// To access the file URL:
console.log(output[0].url()); //=> "http://example.com"

// To write the file to disk:
fs.writeFile("my-image.png", output[0]);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run cjwbw/wuerstchen using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "cjwbw/wuerstchen:6a8baed32201ec714574e439aa57734acba760796731666bbd9470fefbd00039",
    input={
        "width": 1536,
        "height": 1536,
        "prompt": "Anthropomorphic cat dressed as a firefighter",
        "negative_prompt": "",
        "prior_guidance_scale": 4,
        "num_images_per_prompt": 2,
        "decoder_guidance_scale": 0,
        "prior_num_inference_steps": 30,
        "decoder_num_inference_steps": 12
    }
)
print(output)

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run cjwbw/wuerstchen using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "cjwbw/wuerstchen:6a8baed32201ec714574e439aa57734acba760796731666bbd9470fefbd00039",
    "input": {
      "width": 1536,
      "height": 1536,
      "prompt": "Anthropomorphic cat dressed as a firefighter",
      "negative_prompt": "",
      "prior_guidance_scale": 4,
      "num_images_per_prompt": 2,
      "decoder_guidance_scale": 0,
      "prior_num_inference_steps": 30,
      "decoder_num_inference_steps": 12
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

{
  "completed_at": "2023-09-16T02:02:55.991788Z",
  "created_at": "2023-09-16T02:02:45.457545Z",
  "data_removed": false,
  "error": null,
  "id": "lsu7r7bbpgptzr5fhoq5io4pw4",
  "input": {
    "width": 1536,
    "height": 1536,
    "prompt": "Anthropomorphic cat dressed as a firefighter",
    "negative_prompt": "",
    "prior_guidance_scale": 4,
    "num_images_per_prompt": 2,
    "decoder_guidance_scale": 0,
    "prior_num_inference_steps": 30,
    "decoder_num_inference_steps": 12
  },
  "logs": "Using seed: 14945\n  0%|          | 0/29 [00:00<?, ?it/s]\n  3%|▎         | 1/29 [00:00<00:03,  7.21it/s]\n  7%|▋         | 2/29 [00:00<00:03,  8.14it/s]\n 10%|█         | 3/29 [00:00<00:03,  8.62it/s]\n 14%|█▍        | 4/29 [00:00<00:02,  8.86it/s]\n 17%|█▋        | 5/29 [00:00<00:02,  8.99it/s]\n 21%|██        | 6/29 [00:00<00:02,  9.07it/s]\n 24%|██▍       | 7/29 [00:00<00:02,  9.10it/s]\n 28%|██▊       | 8/29 [00:00<00:02,  9.13it/s]\n 31%|███       | 9/29 [00:01<00:02,  9.16it/s]\n 34%|███▍      | 10/29 [00:01<00:02,  9.16it/s]\n 38%|███▊      | 11/29 [00:01<00:01,  9.16it/s]\n 41%|████▏     | 12/29 [00:01<00:01,  9.17it/s]\n 45%|████▍     | 13/29 [00:01<00:01,  9.18it/s]\n 48%|████▊     | 14/29 [00:01<00:01,  9.19it/s]\n 52%|█████▏    | 15/29 [00:01<00:01,  9.18it/s]\n 55%|█████▌    | 16/29 [00:01<00:01,  9.18it/s]\n 59%|█████▊    | 17/29 [00:01<00:01,  9.18it/s]\n 62%|██████▏   | 18/29 [00:01<00:01,  9.19it/s]\n 66%|██████▌   | 19/29 [00:02<00:01,  9.20it/s]\n 69%|██████▉   | 20/29 [00:02<00:00,  9.18it/s]\n 72%|███████▏  | 21/29 [00:02<00:00,  9.17it/s]\n 76%|███████▌  | 22/29 [00:02<00:00,  9.18it/s]\n 79%|███████▉  | 23/29 [00:02<00:00,  9.18it/s]\n 83%|████████▎ | 24/29 [00:02<00:00,  9.19it/s]\n 86%|████████▌ | 25/29 [00:02<00:00,  8.93it/s]\n 90%|████████▉ | 26/29 [00:02<00:00,  9.00it/s]\n 93%|█████████▎| 27/29 [00:02<00:00,  9.06it/s]\n 97%|█████████▋| 28/29 [00:03<00:00,  9.10it/s]\n100%|██████████| 29/29 [00:03<00:00,  9.13it/s]\n100%|██████████| 29/29 [00:03<00:00,  9.07it/s]\n  0%|          | 0/12 [00:00<?, ?it/s]\n  8%|▊         | 1/12 [00:00<00:02,  4.71it/s]\n 17%|█▋        | 2/12 [00:00<00:02,  4.79it/s]\n 25%|██▌       | 3/12 [00:00<00:01,  4.81it/s]\n 33%|███▎      | 4/12 [00:00<00:01,  4.83it/s]\n 42%|████▏     | 5/12 [00:01<00:01,  4.83it/s]\n 50%|█████     | 6/12 [00:01<00:01,  4.84it/s]\n 58%|█████▊    | 7/12 [00:01<00:01,  4.84it/s]\n 67%|██████▋   | 8/12 [00:01<00:00,  4.84it/s]\n 75%|███████▌  | 9/12 [00:01<00:00,  4.83it/s]\n 83%|████████▎ | 10/12 [00:02<00:00,  4.83it/s]\n 92%|█████████▏| 11/12 [00:02<00:00,  4.83it/s]\n100%|██████████| 12/12 [00:02<00:00,  4.84it/s]\n100%|██████████| 12/12 [00:02<00:00,  4.83it/s]",
  "metrics": {
    "predict_time": 10.565402,
    "total_time": 10.534243
  },
  "output": [
    "https://replicate.delivery/pbxt/oZfwnHbhc9SrIyb0GqPkeZ3lbnfYdxnzR0BTplDGOERd2XJjA/out-0.png",
    "https://replicate.delivery/pbxt/JYZEud12pT7FPFV2MtZaSTx6lEr2Z0XMpPb8JBUSYu0zeVyIA/out-1.png"
  ],
  "started_at": "2023-09-16T02:02:45.426386Z",
  "status": "succeeded",
  "urls": {
    "get": "https://api.replicate.com/v1/predictions/lsu7r7bbpgptzr5fhoq5io4pw4",
    "cancel": "https://api.replicate.com/v1/predictions/lsu7r7bbpgptzr5fhoq5io4pw4/cancel"
  },
  "version": "6a8baed32201ec714574e439aa57734acba760796731666bbd9470fefbd00039"
}

Generated in

10.6 seconds

Tweak itReport

Using seed: 14945
  0%|          | 0/29 [00:00<?, ?it/s]
  3%|▎         | 1/29 [00:00<00:03,  7.21it/s]
  7%|▋         | 2/29 [00:00<00:03,  8.14it/s]
 10%|█         | 3/29 [00:00<00:03,  8.62it/s]
 14%|█▍        | 4/29 [00:00<00:02,  8.86it/s]
 17%|█▋        | 5/29 [00:00<00:02,  8.99it/s]
 21%|██        | 6/29 [00:00<00:02,  9.07it/s]
 24%|██▍       | 7/29 [00:00<00:02,  9.10it/s]
 28%|██▊       | 8/29 [00:00<00:02,  9.13it/s]
 31%|███       | 9/29 [00:01<00:02,  9.16it/s]
 34%|███▍      | 10/29 [00:01<00:02,  9.16it/s]
 38%|███▊      | 11/29 [00:01<00:01,  9.16it/s]
 41%|████▏     | 12/29 [00:01<00:01,  9.17it/s]
 45%|████▍     | 13/29 [00:01<00:01,  9.18it/s]
 48%|████▊     | 14/29 [00:01<00:01,  9.19it/s]
 52%|█████▏    | 15/29 [00:01<00:01,  9.18it/s]
 55%|█████▌    | 16/29 [00:01<00:01,  9.18it/s]
 59%|█████▊    | 17/29 [00:01<00:01,  9.18it/s]
 62%|██████▏   | 18/29 [00:01<00:01,  9.19it/s]
 66%|██████▌   | 19/29 [00:02<00:01,  9.20it/s]
 69%|██████▉   | 20/29 [00:02<00:00,  9.18it/s]
 72%|███████▏  | 21/29 [00:02<00:00,  9.17it/s]
 76%|███████▌  | 22/29 [00:02<00:00,  9.18it/s]
 79%|███████▉  | 23/29 [00:02<00:00,  9.18it/s]
 83%|████████▎ | 24/29 [00:02<00:00,  9.19it/s]
 86%|████████▌ | 25/29 [00:02<00:00,  8.93it/s]
 90%|████████▉ | 26/29 [00:02<00:00,  9.00it/s]
 93%|█████████▎| 27/29 [00:02<00:00,  9.06it/s]
 97%|█████████▋| 28/29 [00:03<00:00,  9.10it/s]
100%|██████████| 29/29 [00:03<00:00,  9.13it/s]
100%|██████████| 29/29 [00:03<00:00,  9.07it/s]
  0%|          | 0/12 [00:00<?, ?it/s]
  8%|▊         | 1/12 [00:00<00:02,  4.71it/s]
 17%|█▋        | 2/12 [00:00<00:02,  4.79it/s]
 25%|██▌       | 3/12 [00:00<00:01,  4.81it/s]
 33%|███▎      | 4/12 [00:00<00:01,  4.83it/s]
 42%|████▏     | 5/12 [00:01<00:01,  4.83it/s]
 50%|█████     | 6/12 [00:01<00:01,  4.84it/s]
 58%|█████▊    | 7/12 [00:01<00:01,  4.84it/s]
 67%|██████▋   | 8/12 [00:01<00:00,  4.84it/s]
 75%|███████▌  | 9/12 [00:01<00:00,  4.83it/s]
 83%|████████▎ | 10/12 [00:02<00:00,  4.83it/s]
 92%|█████████▏| 11/12 [00:02<00:00,  4.83it/s]
100%|██████████| 12/12 [00:02<00:00,  4.84it/s]
100%|██████████| 12/12 [00:02<00:00,  4.83it/s]

Examples

View more examples

Run time and cost

This model costs approximately $0.025 to run on Replicate, or 40 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 18 seconds.

Readme

Würstchen: Fast Diffusion for Image Generation

main-figure-github

What is Würstchen?

Würstchen is a diffusion model, whose text-conditional component works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by orders of magnitude. Training on 1024×1024 images is way more expensive than training on 32×32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, it achieves a 42x spatial compression! This had never been seen before, because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the paper). Together Stage A and B are called the Decoder, because they decode the compressed images back into pixel space. A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference. We refer to Stage C as the Prior.

Würstchen images with Prompts

Why another text-to-image model?

Well, this one is pretty fast and efficient. Würstchen’s biggest benefits come from the fact that it can generate images much faster than models like Stable Diffusion XL, while using a lot less memory! So for all of us who don’t have A100s lying around, this will come in handy. Here is a comparison with SDXL over different batch sizes:

Inference Speed Plots

In addition to that, another greatly significant benefit of Würstchen comes with the reduced training costs. Würstchen v1, which works at 512x512, required only 9,000 GPU hours of training. Comparing this to the 150,000 GPU hours spent on Stable Diffusion 1.4 suggests that this 16x reduction in cost not only benefits researchers when conducting new experiments, but it also opens the door for more organizations to train such models. Würstchen v2 used 24,602 GPU hours. With resolutions going up to 1536, this is still 6x cheaper than SD1.4, which was only trained at 512x512.

What image sizes does Würstchen work on?

Würstchen was trained on image resolutions between 1024x1024 & 1536x1536. We sometimes also observe good outputs at resolutions like 1024x2048. Feel free to try it out. We also observed that the Prior (Stage C) adapts extremely fast to new resolutions. So finetuning it at 2048x2048 should be computationally cheap.