arielreplicate / paella_fast_text2image

Fast text2image model

  • Public
  • 756 runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia T4 (High-memory) GPU hardware. Predictions typically complete within 5 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Open In Colab Huggingface Space

Paella

Conditional text-to-image generation has seen countless recent improvements in terms of quality, diversity and fidelity. Nevertheless, most state-of-the-art models require numerous inference steps to produce faithful generations, resulting in performance bottlenecks for end-user applications. In this paper we introduce Paella, a novel text-to-image model requiring less than 10 steps to sample high-fidelity images, using a speed-optimized architecture allowing to sample a single image in less than 500 ms, while having 573M parameters. The model operates on a compressed & quantized latent space, it is conditioned on CLIP embeddings and uses an improved sampling function over previous works. Aside from text-conditional image generation, our model is able to do latent space interpolation and image manipulations such as inpainting, outpainting, and structural editing.

cover-figure

Please find all details about the model and how it was trained in our preprint paper on arxiv.


License

The model code and weights are released under the MIT license.