pku-yuangroup / llava-cot

Let Vision Language Models Reason Step-by-Step

  • Public
  • 40 runs
  • GitHub
  • Weights
  • Paper
  • License
Run with an API

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

🔥 Highlights

LLaVA-CoT is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1!

Our 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six challenging multimodal benchmarks.

📝 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@misc{xu2024llavacot,
      title={LLaVA-CoT: Let Vision Language Models Reason Step-by-Step}, 
      author={Guowei Xu and Peng Jin and Hao Li and Yibing Song and Lichao Sun and Li Yuan},
      year={2024},
      eprint={2411.10440},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.10440}, 
}

🙏 Acknowledgement

  • The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
  • The service is a research preview intended for non-commercial use only, subject to LLAMA 3.2 COMMUNITY LICENSE AGREEMENT, and Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations.
  • The template is modified from Chat-Univi and LLaVA.