ai-forever / kandinsky-2

text2img model trained on LAION HighRes and fine-tuned on internal datasets

  • Public
  • 6.2M runs
  • A100 (80GB)
  • GitHub
  • License

Input

string
Shift + Return to add a new line

Input Prompt

Default: "red cat, 4k photo"

integer
(minimum: 1, maximum: 500)

Number of denoising steps

Default: 50

number
(minimum: 1, maximum: 20)

Scale for classifier-free guidance

Default: 4

string

Choose a scheduler

Default: "p_sampler"

integer

Default: 4

string
Shift + Return to add a new line

Default: "5"

integer

Choose width. Lower the setting if out of memory.

Default: 512

integer

Choose height. Lower the setting if out of memory.

Default: 512

integer

Choose batch size. Lower the setting if out of memory.

Default: 1

integer

Random seed. Leave blank to randomize the seed

string

Format of the output images

Default: "webp"

integer
(minimum: 0, maximum: 100)

Quality of the output images, from 0 to 100. 100 is best quality, 0 is lowest quality.

Default: 80

Output

output
Generated in

This example was created by a different version, ai-forever/kandinsky-2:9c0bf7d5.

Run time and cost

This model costs approximately $0.071 to run on Replicate, or 14 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 52 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Kandinsky 2.1

Model architecture:

Kandinsky 2.1 inherits best practicies from Dall-E 2 and Latent diffucion, while introducing some new ideas.

As text and image encoder it uses CLIP model and diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation.

For diffusion mapping of latent spaces we use transformer with num_layers=20, num_heads=32 and hidden_size=2048.

Other architecture parts:

  • Text encoder (XLM-Roberta-Large-Vit-L-14) - 560M
  • Diffusion Image Prior — 1B
  • CLIP image encoder (ViT-L/14) - 427M
  • Latent Diffusion U-Net - 1.22B
  • MoVQ encoder/decoder - 67M

Kandinsky 2.1 was trained on a large-scale image-text dataset LAION HighRes and fine-tuned on our internal datasets.