Generate vivid images for any (Chinese / English) text
Predictions run on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 8 minutes. The predict time for this model varies significantly based on the inputs.

CogView2 is a hierarchical transformer (6B-9B-9B parameters) for text-to-image generation in general domain. This implementation is based on the SwissArmyTransformer library (v0.2).

  title={CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers},
  author={Ding, Ming and Zheng, Wendi and Hong, Wenyi and Tang, Jie},
  journal={arXiv preprint arXiv:2204.14217},