Text-to-image synthesis using contrastive learning

  • Public
  • 1.2K runs

😵 Uh oh! This model can't be run on Replicate because it was built with a version of Cog that is no longer supported. Consider opening an issue on the model's GitHub repository to see if it can be updated to use a recent version of Cog. If you need any help, please hop into our Discord channel or Contact us about it.

Run time and cost

This model runs on CPU hardware. Predictions typically complete within 11 seconds. The predict time for this model varies significantly based on the inputs.


The goal of text-to-image synthesis is to generate a visually realistic image that matches a given text description. In practice, the captions annotated by humans for the same image have large variance in terms of contents and the choice of words. The linguistic discrepancy between the captions of the identical image leads to the synthetic images deviating from the ground truth. To address this issue, we propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images. In the pre-training stage, we utilize the contrastive learning approach to learn the consistent textual representations for the captions corresponding to the same image. Furthermore, in the following stage of GAN training, we employ the contrastive learning method to enhance the consistency between the generated images from the captions related to the same image.

This is a demo of this approach, using AttnGAN and the bird dataset. Try, for example, “this bird has wings that are red and has a yellow belly”.


If you find this work useful in your research, please consider citing:

  title={Improving Text-to-Image Synthesis Using Contrastive Learning},
  author={Ye, Hui and Yang, Xiulong and Takac, Martin and Sunderraman, Rajshekhar and Ji, Shihao},
  journal={arXiv preprint arXiv:2107.02423},


Our work is based on the following works: - AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks [code] - DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis [code]