Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by training a model that takes as input a text prompt, and returns as an output the VQGAN latent space, which is then transformed into an RGB image. The model is trained on a dataset of text prompts and can be used on unseen text prompts. The loss function is minimizing the distance between the CLIP generated image features and the CLIP input text features. Additionally, a diversity loss can be used to make increase the diversity of the generated images given the same prompt.
Acknowledgements
- The training code is heavily based on the VQGAN-CLIP notebook https://colab.research.google.com/drive/1ZAus_gn2RhTZWzOWUpPERNC0Q8OhZRTZ, thanks to all the authors who contributed to the notebook (@crowsonkb, @advadnoun, @Eleiber, @Crimeacs, @Abulafia)
- Thanks to @lucidrains, the MLP mixer model (
mlp_mixer_pytorch.py
) is from https://github.com/lucidrains/mlp-mixer-pytorch. - Thanks to CompVis for Taming Transformers https://github.com/CompVis/taming-transformers, the code uses VQGAN pre-trained model and VGG16 feature space perceptual loss https://github.com/CompVis/taming-transformers/blob/master/taming/modules/losses/lpips.py
- Thanks to @afiaka87 for all the contributions to the repository’s code and for providing the blog captions dataset for experimentation
- Thanks to VitGAN authors, the VitGAN model is from https://github.com/wilile26811249/ViTGAN
- Thanks to Replicate team, especially @chenxwh and @andreasjansson for making and hosting a browser based text to image interface using the model and for all the support
- Thanks to the authors of CLOOB for the code and the pre-trained models
- Thanks to @crowsonkb, code/models for CLOOB pre-trained on LAION-400M are based on cloob-training
- Thanks to OpenCLIP authors for CLIP-like code/models pre-trained on LAION-400M and LAION-2B
- Thanks to CompVis’s Net2Net (https://github.com/CompVis/net2net), it was used to train text to image embedding priors
- Models were trained on JURECA-DC supercomputer at Jülich Supercomputing Centre (JSC), many thanks for the compute provided to train the models.