🚀 Want to run this model with an API? Get started

arielreplicate/paella_fast_image_interpolation

Public
Fast image interpolation model
128 runs

Run time and cost

Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 27 seconds. The predict time for this model varies significantly based on the inputs.

Open In Colab
Huggingface Space

Paella

Conditional text-to-image generation has seen countless recent improvements in terms of quality, diversity and fidelity. Nevertheless, most state-of-the-art models require numerous inference steps to produce faithful generations, resulting in performance bottlenecks for end-user applications. In this paper we introduce Paella, a novel text-to-image model requiring less than 10 steps to sample high-fidelity images, using a speed-optimized architecture allowing to sample a single image in less than 500 ms, while having 573M parameters. The model operates on a compressed & quantized latent space, it is conditioned on CLIP embeddings and uses an improved sampling function over previous works. Aside from text-conditional image generation, our model is able to do latent space interpolation and image manipulations such as inpainting, outpainting, and structural editing.




cover-figure

Please find all details about the model and how it was trained in our preprint paper on arxiv.


License

The model code and weights are released under the MIT license.