chenxwh / cogvlm2

CogVLM2: Visual Language Models for Image and Video Understanding

  • Public
  • 622 runs
  • L40S
  • GitHub
  • Weights
  • Paper
  • License
Iterate in playground

Input

input_image
*file

Input image

string
Shift + Return to add a new line

Input prompt

Default: "Describe this image."

number
(minimum: 0, maximum: 1)

When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens

Default: 0.9

number
(minimum: 0)

Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic

Default: 0.7

integer
(minimum: 0)

Maximum number of tokens to generate. A word is generally 2-3 tokens

Default: 2048

Output

The image captures a well-lit, modern library or bookstore with a distinct industrial aesthetic. The main focus is a large, wooden bookshelf filled with an assortment of books, creating a warm and inviting atmosphere. The bookshelf is positioned against a rustic brick wall, which adds a touch of vintage charm to the space. The room is illuminated by hanging light bulbs, which dangle from the ceiling in a casual manner. There are also decorative elements such as a potted plant, a small framed sign, and a table with various items on it, enhancing the cozy ambiance. A person is seated at a desk in the foreground, suggesting the space is functional for reading or studying.
Generated in

Run time and cost

This model costs approximately $0.052 to run on Replicate, or 19 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 54 seconds. The predict time for this model varies significantly based on the inputs.

Readme

CogVLM2

Model introduction

We launch a new generation of CogVLM2 series of models and open source two models based on Meta-Llama-3-8B-Instruct. Compared with the previous generation of CogVLM open source models, the CogVLM2 series of open source models have the following improvements:

  1. Significant improvements in many benchmarks such as TextVQA, DocVQA.
  2. Support 8K content length.
  3. Support image resolution up to 1344 * 1344.
  4. Provide an open source model version that supports both Chinese and English.

Image Understand

Our open source models have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models.

License

This model is released under the CogVLM2 CogVLM2 LICENSE. For models built with Meta Llama 3, please also adhere to the LLAMA3_LICENSE.

Citation

If you find our work helpful, please consider citing the following papers

@article{hong2024cogvlm2,
  title={CogVLM2: Visual Language Models for Image and Video Understanding},
  author={Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others},
  journal={arXiv preprint arXiv:2408.16500},
  year={2024}
}
@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}