lucataco / cogview4-6b

CogView-4 model, which has 6B parameters, supports native Chinese input, and Chinese text-to-image generation.

  • Public
  • 94 runs
  • L40S
  • Weights
  • Paper
  • License
Iterate in playground
Run with an API

Input

*string
Shift + Return to add a new line

Text prompt to generate an image from

string
Shift + Return to add a new line

Negative prompt to guide image generation away from certain concepts

integer
(minimum: 512, maximum: 2048)

Width of the generated image (must be between 512 and 2048, divisible by 32)

Default: 1024

integer
(minimum: 512, maximum: 2048)

Height of the generated image (must be between 512 and 2048, divisible by 32)

Default: 1024

integer
(minimum: 1, maximum: 100)

Number of denoising steps

Default: 50

number
(minimum: 0, maximum: 20)

Guidance scale for classifier-free guidance

Default: 3.5

integer

Random seed for reproducible image generation

Output

output
Generated in

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

CogView4-6B

🤗 Space | 🌐 Github | 📜 arxiv

img

Inference Requirements and Model Introduction

  • Resolution: Width and height must be between 512px and 2048px, divisible by 32, and ensure the maximum number of pixels does not exceed 2^21 px.
  • Precision: BF16 / FP32 (FP16 is not supported as it will cause overflow resulting in completely black images)

Using BF16 precision with batchsize=4 for testing, the memory usage is shown in the table below:

Resolution enable_model_cpu_offload OFF enable_model_cpu_offload ON enable_model_cpu_offload ON
Text Encoder 4bit
512 * 512 33GB 20GB 13G
1280 * 720 35GB 20GB 13G
1024 * 1024 35GB 20GB 13G
1920 * 1280 39GB 20GB 14G
2048 * 2048 43GB 21GB 14G

Model Metrics

We’ve tested on multiple benchmarks and achieved the following scores:

DPG-Bench

Model Overall Global Entity Attribute Relation Other
SDXL 74.65 83.27 82.43 80.91 86.76 80.41
PixArt-alpha 71.11 74.97 79.32 78.60 82.57 76.96
SD3-Medium 84.08 87.90 91.01 88.83 80.70 88.68
DALL-E 3 83.50 90.97 89.61 88.39 90.58 89.83
Flux.1-dev 83.79 85.80 86.79 89.98 90.04 89.90
Janus-Pro-7B 84.19 86.90 88.90 89.40 89.32 89.48
CogView4-6B 85.13 83.85 90.35 91.17 91.14 87.29

GenEval

Model Overall Single Obj. Two Obj. Counting Colors Position Color attribution
SDXL 0.55 0.98 0.74 0.39 0.85 0.15 0.23
PixArt-alpha 0.48 0.98 0.50 0.44 0.80 0.08 0.07
SD3-Medium 0.74 0.99 0.94 0.72 0.89 0.33 0.60
DALL-E 3 0.67 0.96 0.87 0.47 0.83 0.43 0.45
Flux.1-dev 0.66 0.98 0.79 0.73 0.77 0.22 0.45
Janus-Pro-7B 0.80 0.99 0.89 0.59 0.90 0.79 0.66
CogView4-6B 0.73 0.99 0.86 0.66 0.79 0.48 0.58

T2I-CompBench

Model Color Shape Texture 2D-Spatial 3D-Spatial Numeracy Non-spatial Clip Complex 3-in-1
SDXL 0.5879 0.4687 0.5299 0.2133 0.3566 0.4988 0.3119 0.3237
PixArt-alpha 0.6690 0.4927 0.6477 0.2064 0.3901 0.5058 0.3197 0.3433
SD3-Medium 0.8132 0.5885 0.7334 0.3200 0.4084 0.6174 0.3140 0.3771
DALL-E 3 0.7785 0.6205 0.7036 0.2865 0.3744 0.5880 0.3003 0.3773
Flux.1-dev 0.7572 0.5066 0.6300 0.2700 0.3992 0.6165 0.3065 0.3628
Janus-Pro-7B 0.5145 0.3323 0.4069 0.1566 0.2753 0.4406 0.3137 0.3806
CogView4-6B 0.7786 0.5880 0.6983 0.3075 0.3708 0.6626 0.3056 0.3869

Chinese Text Accuracy Evaluation

Model Precision Recall F1 Score Pick@4
Kolors 0.6094 0.1886 0.2880 0.1633
CogView4-6B 0.6969 0.5532 0.6168 0.3265

Citation

🌟 If you find our work helpful, please consider citing our paper and leaving valuable stars

@article{zheng2024cogview3,
  title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
  author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
  journal={arXiv preprint arXiv:2403.05121},
  year={2024}
}

License

This model is released under the Apache 2.0 License.