Readme
Check out the different LLaVA’s on Replicate:
Name | Version | Base | Size | Finetunable |
---|---|---|---|---|
v1.5 - Vicuna-13B | v1.5 | Vicuna | 13B | Yes |
v1.6 - Vicuna-13B | v1.6 | Vicuna | 13B | No |
v1.6 - Vicuna-7B | v1.6 | Vicuna | 7B | No |
v1.6 - Mistral-7B | v1.6 | Mistral | 7B | No |
v1.6 - Nous-Hermes-2-34B | v1.6 | Nous-Hermes-2 | 34B | No |
🌋 LLaVA v1.6: Large Language and Vision Assistant
Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.
[Project Page] [Demo] [Data] [Model Zoo]
Improved Baselines with Visual Instruction Tuning [Paper]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)
LLaVA v1.6 changes
LLaVA-1.6 is out! With additional scaling to LLaVA-1.5, LLaVA-1.6-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post!
Summary
LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.