mehdidc / feed_forward_vqgan_clip

Feed forward VQGAN-CLIP model

  • Public
  • 130.3K runs
  • GitHub
  • License

Input

Output

Run time and cost

This model costs approximately $0.00027 to run on Replicate, or 3703 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 (High-memory) GPU hardware. Predictions typically complete within 2 seconds.

Readme

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by training a model that takes as input a text prompt, and returns as an output the VQGAN latent space, which is then transformed into an RGB image. The model is trained on a dataset of text prompts and can be used on unseen text prompts. The loss function is minimizing the distance between the CLIP generated image features and the CLIP input text features. Additionally, a diversity loss can be used to make increase the diversity of the generated images given the same prompt.

Acknowledgements