Readme

Check out the different LLaVA’s on Replicate:

Name	Version	Base	Size	Finetunable
v1.5 - Vicuna-13B	v1.5	Vicuna	13B	Yes
v1.6 - Vicuna-13B	v1.6	Vicuna	13B	No
v1.6 - Vicuna-7B	v1.6	Vicuna	7B	No
v1.6 - Mistral-7B	v1.6	Mistral	7B	No
v1.6 - Nous-Hermes-2-34B	v1.6	Nous-Hermes-2	34B	No

🌋 LLaVA v1.6: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

[Project Page] [Demo] [Data] [Model Zoo]

Improved Baselines with Visual Instruction Tuning [Paper]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

LLaVA v1.6 changes

LLaVA-1.6 is out! With additional scaling to LLaVA-1.5, LLaVA-1.6-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post!

Summary

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

Model created over 1 year ago