chenxwh / llava-cot

Let Vision Language Models Reason Step-by-Step

  • Public
  • 13 runs
  • GitHub
  • Weights
  • Paper
  • License

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

🔥 Highlights

LLaVA-CoT is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1!

Our 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six challenging multimodal benchmarks.

📝 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@misc{xu2024llavacot,
      title={LLaVA-CoT: Let Vision Language Models Reason Step-by-Step}, 
      author={Guowei Xu and Peng Jin and Hao Li and Yibing Song and Lichao Sun and Li Yuan},
      year={2024},
      eprint={2411.10440},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.10440}, 
}

🙏 Acknowledgement

  • The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
  • The service is a research preview intended for non-commercial use only, subject to LLAMA 3.2 COMMUNITY LICENSE AGREEMENT, and Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations.
  • The template is modified from Chat-Univi and LLaVA.