arielreplicate / paella_fast_image_variation

Fast image variation model

  • Public
  • 675 runs
  • GitHub
  • Paper
  • License



Run time and cost

This model runs on Nvidia T4 (High-memory) GPU hardware. Predictions typically complete within 96 seconds. The predict time for this model varies significantly based on the inputs.


Open In Colab Huggingface Space


Conditional text-to-image generation has seen countless recent improvements in terms of quality, diversity and fidelity. Nevertheless, most state-of-the-art models require numerous inference steps to produce faithful generations, resulting in performance bottlenecks for end-user applications. In this paper we introduce Paella, a novel text-to-image model requiring less than 10 steps to sample high-fidelity images, using a speed-optimized architecture allowing to sample a single image in less than 500 ms, while having 573M parameters. The model operates on a compressed & quantized latent space, it is conditioned on CLIP embeddings and uses an improved sampling function over previous works. Aside from text-conditional image generation, our model is able to do latent space interpolation and image manipulations such as inpainting, outpainting, and structural editing.


Please find all details about the model and how it was trained in our preprint paper on arxiv.


The model code and weights are released under the MIT license.