cjwbw / cogvlm

powerful open-source visual language model

  • Public
  • 533.9K runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 120 seconds. The predict time for this model varies significantly based on the inputs.

Readme

CogVLM

Introduction

  • CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters.

  • CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and rank the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. CogVLM can also chat with you about images.

Method LLM MM-VET POPE(adversarial) TouchStone
BLIP-2 Vicuna-13B 22.4 - -
Otter MPT-7B 24.7 - -
MiniGPT4 Vicuna-13B 24.4 70.4 531.7
InstructBLIP Vicuna-13B 25.6 77.3 552.4
LLaMA-Adapter v2 LLaMA-7B 31.4 - 590.1
LLaVA LLaMA2-7B 28.1 66.3 602.7
mPLUG-Owl LLaMA-7B - 66.8 605.4
LLaVA-1.5 Vicuna-13B 36.3 84.5 -
Emu LLaMA-13B 36.3 - -
Qwen-VL-Chat - - - 645.2
DreamLLM Vicuna-7B 35.9 76.5 -
CogVLM Vicuna-7B 52.8 87.6 742.0

Method

CogVLM model comprises four fundamental components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a visual expert module. See Paper for more details.

License

The code in this repository is open source under the Apache-2.0 license, while the use of the CogVLM model weights must comply with the Model License.

Citation & Acknowledgements

If you find our work helpful, please consider citing the following papers

@article{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

In the instruction fine-tuning phase of the CogVLM, there are some English image-text data from the MiniGPT-4, LLAVA, LRV-Instruction, LLaVAR and Shikra projects, as well as many classic cross-modal work datasets. We sincerely thank them for their contributions.