ai-forever / kandinsky-2

text2img model trained on LAION HighRes and fine-tuned on internal datasets

Cold

Public
6.2M runs
A100 (80GB)
GitHub
License

Run with an API

Playground API Examples README Versions

Input

prompt

string

Shift + Return to add a new line

Input Prompt

Default: "red cat, 4k photo"

num_inference_steps

integer

(minimum: 1, maximum: 500)

Number of denoising steps

Default: 50

guidance_scale

number

(minimum: 1, maximum: 20)

Scale for classifier-free guidance

Default: 4

scheduler

string

Choose a scheduler

Default: "p_sampler"

prior_cf_scale

integer

Default: 4

prior_steps

string

Shift + Return to add a new line

Default: "5"

width

integer

Choose width. Lower the setting if out of memory.

Default: 512

height

integer

Choose height. Lower the setting if out of memory.

Default: 512

batch_size

integer

Choose batch size. Lower the setting if out of memory.

Default: 1

seed

integer

Random seed. Leave blank to randomize the seed

output_format

string

Format of the output images

Default: "webp"

output_quality

integer

(minimum: 0, maximum: 100)

Quality of the output images, from 0 to 100. 100 is best quality, 0 is lowest quality.

Default: 80

Run this model in Node.js with one line of code:

npx create-replicate --model=ai-forever/kandinsky-2

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run ai-forever/kandinsky-2 using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "ai-forever/kandinsky-2:3c6374e7a9a17e01afe306a5218cc67de55b19ea536466d6ea2602cfecea40a9",
  {
    input: {
      width: 512,
      height: 512,
      prompt: "red cat, 4k photo",
      scheduler: "p_sampler",
      batch_size: 1,
      prior_steps: "5",
      output_format: "webp",
      guidance_scale: 4,
      output_quality: 80,
      prior_cf_scale: 4,
      num_inference_steps: 100
    }
  }
);

// To access the file URL:
console.log(output[0].url()); //=> "http://example.com"

// To write the file to disk:
fs.writeFile("my-image.png", output[0]);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run ai-forever/kandinsky-2 using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "ai-forever/kandinsky-2:3c6374e7a9a17e01afe306a5218cc67de55b19ea536466d6ea2602cfecea40a9",
    input={
        "width": 512,
        "height": 512,
        "prompt": "red cat, 4k photo",
        "scheduler": "p_sampler",
        "batch_size": 1,
        "prior_steps": "5",
        "output_format": "webp",
        "guidance_scale": 4,
        "output_quality": 80,
        "prior_cf_scale": 4,
        "num_inference_steps": 100
    }
)
print(output)

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run ai-forever/kandinsky-2 using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "ai-forever/kandinsky-2:3c6374e7a9a17e01afe306a5218cc67de55b19ea536466d6ea2602cfecea40a9",
    "input": {
      "width": 512,
      "height": 512,
      "prompt": "red cat, 4k photo",
      "scheduler": "p_sampler",
      "batch_size": 1,
      "prior_steps": "5",
      "output_format": "webp",
      "guidance_scale": 4,
      "output_quality": 80,
      "prior_cf_scale": 4,
      "num_inference_steps": 100
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

{
  "completed_at": "2023-04-05T14:18:17.379573Z",
  "created_at": "2023-04-05T14:17:05.824020Z",
  "data_removed": false,
  "error": null,
  "id": "wixq75uo65glnjcdgrqtngdz4q",
  "input": {
    "prompt": "red cat, 4k photo",
    "scheduler": "p_sampler",
    "prior_steps": "5",
    "guidance_scale": 4,
    "prior_cf_scale": 4,
    "num_inference_steps": 100
  },
  "logs": "0%|          | 0/100 [00:00<?, ?it/s]\n  1%|          | 1/100 [00:00<01:21,  1.21it/s]\n  2%|▏         | 2/100 [00:01<01:08,  1.42it/s]\n  3%|▎         | 3/100 [00:02<01:04,  1.51it/s]\n  4%|▍         | 4/100 [00:02<01:02,  1.54it/s]\n  5%|▌         | 5/100 [00:03<01:00,  1.56it/s]\n  6%|▌         | 6/100 [00:03<01:00,  1.57it/s]\n  7%|▋         | 7/100 [00:04<00:59,  1.57it/s]\n  8%|▊         | 8/100 [00:05<00:58,  1.58it/s]\n  9%|▉         | 9/100 [00:05<00:57,  1.58it/s]\n 10%|█         | 10/100 [00:06<00:56,  1.58it/s]\n 11%|█         | 11/100 [00:07<00:56,  1.58it/s]\n 12%|█▏        | 12/100 [00:07<00:55,  1.58it/s]\n 13%|█▎        | 13/100 [00:08<00:54,  1.59it/s]\n 14%|█▍        | 14/100 [00:08<00:54,  1.58it/s]\n 15%|█▌        | 15/100 [00:09<00:53,  1.58it/s]\n 16%|█▌        | 16/100 [00:10<00:53,  1.58it/s]\n 17%|█▋        | 17/100 [00:10<00:52,  1.59it/s]\n 18%|█▊        | 18/100 [00:11<00:51,  1.59it/s]\n 19%|█▉        | 19/100 [00:12<00:51,  1.59it/s]\n 20%|██        | 20/100 [00:12<00:50,  1.58it/s]\n 21%|██        | 21/100 [00:13<00:50,  1.58it/s]\n 22%|██▏       | 22/100 [00:14<00:49,  1.58it/s]\n 23%|██▎       | 23/100 [00:14<00:48,  1.58it/s]\n 24%|██▍       | 24/100 [00:15<00:48,  1.58it/s]\n 25%|██▌       | 25/100 [00:15<00:47,  1.57it/s]\n 26%|██▌       | 26/100 [00:16<00:46,  1.58it/s]\n 27%|██▋       | 27/100 [00:17<00:46,  1.57it/s]\n 28%|██▊       | 28/100 [00:17<00:45,  1.58it/s]\n 29%|██▉       | 29/100 [00:18<00:45,  1.57it/s]\n 30%|███       | 30/100 [00:19<00:44,  1.57it/s]\n 31%|███       | 31/100 [00:19<00:44,  1.57it/s]\n 32%|███▏      | 32/100 [00:20<00:43,  1.57it/s]\n 33%|███▎      | 33/100 [00:21<00:42,  1.57it/s]\n 34%|███▍      | 34/100 [00:21<00:42,  1.57it/s]\n 35%|███▌      | 35/100 [00:22<00:41,  1.56it/s]\n 36%|███▌      | 36/100 [00:22<00:40,  1.56it/s]\n 37%|███▋      | 37/100 [00:23<00:40,  1.56it/s]\n 38%|███▊      | 38/100 [00:24<00:39,  1.56it/s]\n 39%|███▉      | 39/100 [00:24<00:39,  1.56it/s]\n 40%|████      | 40/100 [00:25<00:38,  1.56it/s]\n 41%|████      | 41/100 [00:26<00:37,  1.56it/s]\n 42%|████▏     | 42/100 [00:26<00:37,  1.55it/s]\n 43%|████▎     | 43/100 [00:27<00:36,  1.56it/s]\n 44%|████▍     | 44/100 [00:28<00:35,  1.56it/s]\n 45%|████▌     | 45/100 [00:28<00:35,  1.56it/s]\n 46%|████▌     | 46/100 [00:29<00:34,  1.56it/s]\n 47%|████▋     | 47/100 [00:30<00:33,  1.56it/s]\n 48%|████▊     | 48/100 [00:30<00:33,  1.56it/s]\n 49%|████▉     | 49/100 [00:31<00:32,  1.56it/s]\n 50%|█████     | 50/100 [00:31<00:32,  1.56it/s]\n 51%|█████     | 51/100 [00:32<00:31,  1.56it/s]\n 52%|█████▏    | 52/100 [00:33<00:30,  1.55it/s]\n 53%|█████▎    | 53/100 [00:33<00:30,  1.55it/s]\n 54%|█████▍    | 54/100 [00:34<00:29,  1.55it/s]\n 55%|█████▌    | 55/100 [00:35<00:28,  1.55it/s]\n 56%|█████▌    | 56/100 [00:35<00:28,  1.55it/s]\n 57%|█████▋    | 57/100 [00:36<00:27,  1.55it/s]\n 58%|█████▊    | 58/100 [00:37<00:27,  1.54it/s]\n 59%|█████▉    | 59/100 [00:37<00:26,  1.54it/s]\n 60%|██████    | 60/100 [00:38<00:25,  1.54it/s]\n 61%|██████    | 61/100 [00:39<00:25,  1.54it/s]\n 62%|██████▏   | 62/100 [00:39<00:24,  1.54it/s]\n 63%|██████▎   | 63/100 [00:40<00:24,  1.54it/s]\n 64%|██████▍   | 64/100 [00:41<00:23,  1.54it/s]\n 65%|██████▌   | 65/100 [00:41<00:22,  1.54it/s]\n 66%|██████▌   | 66/100 [00:42<00:22,  1.54it/s]\n 67%|██████▋   | 67/100 [00:42<00:21,  1.54it/s]\n 68%|██████▊   | 68/100 [00:43<00:20,  1.54it/s]\n 69%|██████▉   | 69/100 [00:44<00:20,  1.54it/s]\n 70%|███████   | 70/100 [00:44<00:19,  1.54it/s]\n 71%|███████   | 71/100 [00:45<00:18,  1.54it/s]\n 72%|███████▏  | 72/100 [00:46<00:18,  1.54it/s]\n 73%|███████▎  | 73/100 [00:46<00:17,  1.53it/s]\n 74%|███████▍  | 74/100 [00:47<00:16,  1.54it/s]\n 75%|███████▌  | 75/100 [00:48<00:16,  1.53it/s]\n 76%|███████▌  | 76/100 [00:48<00:15,  1.54it/s]\n 77%|███████▋  | 77/100 [00:49<00:14,  1.54it/s]\n 78%|███████▊  | 78/100 [00:50<00:14,  1.54it/s]\n 79%|███████▉  | 79/100 [00:50<00:13,  1.54it/s]\n 80%|████████  | 80/100 [00:51<00:13,  1.54it/s]\n 81%|████████  | 81/100 [00:52<00:12,  1.54it/s]\n 82%|████████▏ | 82/100 [00:52<00:11,  1.54it/s]\n 83%|████████▎ | 83/100 [00:53<00:11,  1.54it/s]\n 84%|████████▍ | 84/100 [00:54<00:10,  1.54it/s]\n 85%|████████▌ | 85/100 [00:54<00:09,  1.54it/s]\n 86%|████████▌ | 86/100 [00:55<00:09,  1.54it/s]\n 87%|████████▋ | 87/100 [00:55<00:08,  1.54it/s]\n 88%|████████▊ | 88/100 [00:56<00:07,  1.54it/s]\n 89%|████████▉ | 89/100 [00:57<00:07,  1.54it/s]\n 90%|█████████ | 90/100 [00:57<00:06,  1.54it/s]\n 91%|█████████ | 91/100 [00:58<00:05,  1.54it/s]\n 92%|█████████▏| 92/100 [00:59<00:05,  1.54it/s]\n 93%|█████████▎| 93/100 [00:59<00:04,  1.54it/s]\n 94%|█████████▍| 94/100 [01:00<00:03,  1.54it/s]\n 95%|█████████▌| 95/100 [01:01<00:03,  1.53it/s]\n 96%|█████████▌| 96/100 [01:01<00:02,  1.53it/s]\n 97%|█████████▋| 97/100 [01:02<00:01,  1.53it/s]\n 98%|█████████▊| 98/100 [01:03<00:01,  1.53it/s]\n 99%|█████████▉| 99/100 [01:03<00:00,  1.52it/s]\n100%|██████████| 100/100 [01:04<00:00,  1.52it/s]\n100%|██████████| 100/100 [01:04<00:00,  1.55it/s]",
  "metrics": {
    "predict_time": 71.261476,
    "total_time": 71.555553
  },
  "output": "https://replicate.delivery/pbxt/NsOpfQRos43e40IzSq4SY7NTtxGodmSWo1m74K17SVpoUzuQA/out.png",
  "started_at": "2023-04-05T14:17:06.118097Z",
  "status": "succeeded",
  "urls": {
    "get": "https://api.replicate.com/v1/predictions/wixq75uo65glnjcdgrqtngdz4q",
    "cancel": "https://api.replicate.com/v1/predictions/wixq75uo65glnjcdgrqtngdz4q/cancel"
  },
  "version": "9c0bf7d5cf2ed934c5921faf61882657c03c4def9d9cb88330c15bd795edb098"
}

Generated in

71.3 seconds

Tweak itReport

0%|          | 0/100 [00:00<?, ?it/s]
  1%|          | 1/100 [00:00<01:21,  1.21it/s]
  2%|▏         | 2/100 [00:01<01:08,  1.42it/s]
  3%|▎         | 3/100 [00:02<01:04,  1.51it/s]
  4%|▍         | 4/100 [00:02<01:02,  1.54it/s]
  5%|▌         | 5/100 [00:03<01:00,  1.56it/s]
  6%|▌         | 6/100 [00:03<01:00,  1.57it/s]
  7%|▋         | 7/100 [00:04<00:59,  1.57it/s]
  8%|▊         | 8/100 [00:05<00:58,  1.58it/s]
  9%|▉         | 9/100 [00:05<00:57,  1.58it/s]
 10%|█         | 10/100 [00:06<00:56,  1.58it/s]
 11%|█         | 11/100 [00:07<00:56,  1.58it/s]
 12%|█▏        | 12/100 [00:07<00:55,  1.58it/s]
 13%|█▎        | 13/100 [00:08<00:54,  1.59it/s]
 14%|█▍        | 14/100 [00:08<00:54,  1.58it/s]
 15%|█▌        | 15/100 [00:09<00:53,  1.58it/s]
 16%|█▌        | 16/100 [00:10<00:53,  1.58it/s]
 17%|█▋        | 17/100 [00:10<00:52,  1.59it/s]
 18%|█▊        | 18/100 [00:11<00:51,  1.59it/s]
 19%|█▉        | 19/100 [00:12<00:51,  1.59it/s]
 20%|██        | 20/100 [00:12<00:50,  1.58it/s]
 21%|██        | 21/100 [00:13<00:50,  1.58it/s]
 22%|██▏       | 22/100 [00:14<00:49,  1.58it/s]
 23%|██▎       | 23/100 [00:14<00:48,  1.58it/s]
 24%|██▍       | 24/100 [00:15<00:48,  1.58it/s]
 25%|██▌       | 25/100 [00:15<00:47,  1.57it/s]
 26%|██▌       | 26/100 [00:16<00:46,  1.58it/s]
 27%|██▋       | 27/100 [00:17<00:46,  1.57it/s]
 28%|██▊       | 28/100 [00:17<00:45,  1.58it/s]
 29%|██▉       | 29/100 [00:18<00:45,  1.57it/s]
 30%|███       | 30/100 [00:19<00:44,  1.57it/s]
 31%|███       | 31/100 [00:19<00:44,  1.57it/s]
 32%|███▏      | 32/100 [00:20<00:43,  1.57it/s]
 33%|███▎      | 33/100 [00:21<00:42,  1.57it/s]
 34%|███▍      | 34/100 [00:21<00:42,  1.57it/s]
 35%|███▌      | 35/100 [00:22<00:41,  1.56it/s]
 36%|███▌      | 36/100 [00:22<00:40,  1.56it/s]
 37%|███▋      | 37/100 [00:23<00:40,  1.56it/s]
 38%|███▊      | 38/100 [00:24<00:39,  1.56it/s]
 39%|███▉      | 39/100 [00:24<00:39,  1.56it/s]
 40%|████      | 40/100 [00:25<00:38,  1.56it/s]
 41%|████      | 41/100 [00:26<00:37,  1.56it/s]
 42%|████▏     | 42/100 [00:26<00:37,  1.55it/s]
 43%|████▎     | 43/100 [00:27<00:36,  1.56it/s]
 44%|████▍     | 44/100 [00:28<00:35,  1.56it/s]
 45%|████▌     | 45/100 [00:28<00:35,  1.56it/s]
 46%|████▌     | 46/100 [00:29<00:34,  1.56it/s]
 47%|████▋     | 47/100 [00:30<00:33,  1.56it/s]
 48%|████▊     | 48/100 [00:30<00:33,  1.56it/s]
 49%|████▉     | 49/100 [00:31<00:32,  1.56it/s]
 50%|█████     | 50/100 [00:31<00:32,  1.56it/s]
 51%|█████     | 51/100 [00:32<00:31,  1.56it/s]
 52%|█████▏    | 52/100 [00:33<00:30,  1.55it/s]
 53%|█████▎    | 53/100 [00:33<00:30,  1.55it/s]
 54%|█████▍    | 54/100 [00:34<00:29,  1.55it/s]
 55%|█████▌    | 55/100 [00:35<00:28,  1.55it/s]
 56%|█████▌    | 56/100 [00:35<00:28,  1.55it/s]
 57%|█████▋    | 57/100 [00:36<00:27,  1.55it/s]
 58%|█████▊    | 58/100 [00:37<00:27,  1.54it/s]
 59%|█████▉    | 59/100 [00:37<00:26,  1.54it/s]
 60%|██████    | 60/100 [00:38<00:25,  1.54it/s]
 61%|██████    | 61/100 [00:39<00:25,  1.54it/s]
 62%|██████▏   | 62/100 [00:39<00:24,  1.54it/s]
 63%|██████▎   | 63/100 [00:40<00:24,  1.54it/s]
 64%|██████▍   | 64/100 [00:41<00:23,  1.54it/s]
 65%|██████▌   | 65/100 [00:41<00:22,  1.54it/s]
 66%|██████▌   | 66/100 [00:42<00:22,  1.54it/s]
 67%|██████▋   | 67/100 [00:42<00:21,  1.54it/s]
 68%|██████▊   | 68/100 [00:43<00:20,  1.54it/s]
 69%|██████▉   | 69/100 [00:44<00:20,  1.54it/s]
 70%|███████   | 70/100 [00:44<00:19,  1.54it/s]
 71%|███████   | 71/100 [00:45<00:18,  1.54it/s]
 72%|███████▏  | 72/100 [00:46<00:18,  1.54it/s]
 73%|███████▎  | 73/100 [00:46<00:17,  1.53it/s]
 74%|███████▍  | 74/100 [00:47<00:16,  1.54it/s]
 75%|███████▌  | 75/100 [00:48<00:16,  1.53it/s]
 76%|███████▌  | 76/100 [00:48<00:15,  1.54it/s]
 77%|███████▋  | 77/100 [00:49<00:14,  1.54it/s]
 78%|███████▊  | 78/100 [00:50<00:14,  1.54it/s]
 79%|███████▉  | 79/100 [00:50<00:13,  1.54it/s]
 80%|████████  | 80/100 [00:51<00:13,  1.54it/s]
 81%|████████  | 81/100 [00:52<00:12,  1.54it/s]
 82%|████████▏ | 82/100 [00:52<00:11,  1.54it/s]
 83%|████████▎ | 83/100 [00:53<00:11,  1.54it/s]
 84%|████████▍ | 84/100 [00:54<00:10,  1.54it/s]
 85%|████████▌ | 85/100 [00:54<00:09,  1.54it/s]
 86%|████████▌ | 86/100 [00:55<00:09,  1.54it/s]
 87%|████████▋ | 87/100 [00:55<00:08,  1.54it/s]
 88%|████████▊ | 88/100 [00:56<00:07,  1.54it/s]
 89%|████████▉ | 89/100 [00:57<00:07,  1.54it/s]
 90%|█████████ | 90/100 [00:57<00:06,  1.54it/s]
 91%|█████████ | 91/100 [00:58<00:05,  1.54it/s]
 92%|█████████▏| 92/100 [00:59<00:05,  1.54it/s]
 93%|█████████▎| 93/100 [00:59<00:04,  1.54it/s]
 94%|█████████▍| 94/100 [01:00<00:03,  1.54it/s]
 95%|█████████▌| 95/100 [01:01<00:03,  1.53it/s]
 96%|█████████▌| 96/100 [01:01<00:02,  1.53it/s]
 97%|█████████▋| 97/100 [01:02<00:01,  1.53it/s]
 98%|█████████▊| 98/100 [01:03<00:01,  1.53it/s]
 99%|█████████▉| 99/100 [01:03<00:00,  1.52it/s]
100%|██████████| 100/100 [01:04<00:00,  1.52it/s]
100%|██████████| 100/100 [01:04<00:00,  1.55it/s]

This output was created using a different version of the model, ai-forever/kandinsky-2:9c0bf7d5.

Examples

View more examples

Run time and cost

This model costs approximately $0.0039 to run on Replicate, or 256 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 3 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Kandinsky 2.1

Model architecture:

Kandinsky 2.1 inherits best practicies from Dall-E 2 and Latent diffucion, while introducing some new ideas.

As text and image encoder it uses CLIP model and diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation.

For diffusion mapping of latent spaces we use transformer with num_layers=20, num_heads=32 and hidden_size=2048.

Other architecture parts:

Text encoder (XLM-Roberta-Large-Vit-L-14) - 560M
Diffusion Image Prior — 1B
CLIP image encoder (ViT-L/14) - 427M
Latent Diffusion U-Net - 1.22B
MoVQ encoder/decoder - 67M

Kandinsky 2.1 was trained on a large-scale image-text dataset LAION HighRes and fine-tuned on our internal datasets.