chenxwh / cogview3

Finer and Faster Text-to-Image Generation via Relay Diffusion

  • Public
  • 44 runs
  • L40S
  • GitHub
  • Weights
  • Paper
  • License

Input

string
Shift + Return to add a new line

Input prompt

Default: "a photo of an astronaut riding a horse on mars"

string
Shift + Return to add a new line

Specify things to not see in the output

Default: ""

integer

Width of output image. Maximum size is 1024x768 or 768x1024 because of memory limits

Default: 1024

integer

Height of output image. Maximum size is 1024x768 or 768x1024 because of memory limits

Default: 1024

integer
(minimum: 1, maximum: 500)

Number of denoising steps

Default: 50

number
(minimum: 1, maximum: 20)

Scale for classifier-free guidance

Default: 7

integer

Random seed. Leave blank to randomize the seed

Output

output
Generated in

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

CogView3 & CogView-3Plus

Model Introduction

CogView-3-Plus builds upon CogView3 (ECCV‘24) by introducing the latest DiT framework for further overall performance improvements. CogView-3-Plus uses the Zero-SNR diffusion noise scheduling and incorporates a joint text-image attention mechanism. Compared to the commonly used MMDiT structure, it effectively reduces training and inference costs while maintaining the model’s basic capabilities. CogView-3Plus utilizes a VAE with a latent dimension of 16.

Citation

🌟 If you find our work helpful, feel free to cite our paper and leave a star.

@article{zheng2024cogview3,
  title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
  author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
  journal={arXiv preprint arXiv:2403.05121},
  year={2024}
}