yorickvp / llava-v1.6-34b

LLaVA v1.6: Large Language and Vision Assistant (Nous-Hermes-2-34B)

  • Public
  • 1.1M runs
  • GitHub
  • License

Input

Output

Run time and cost

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 24 seconds.

Readme

Check out the different LLaVA’s on Replicate:

Name Version Base Size Finetunable
v1.5 - Vicuna-13B v1.5 Vicuna 13B Yes
v1.6 - Vicuna-13B v1.6 Vicuna 13B No
v1.6 - Vicuna-7B v1.6 Vicuna 7B No
v1.6 - Mistral-7B v1.6 Mistral 7B No
v1.6 - Nous-Hermes-2-34B v1.6 Nous-Hermes-2 34B No

🌋 LLaVA v1.6: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

[Project Page] [Demo] [Data] [Model Zoo]

Improved Baselines with Visual Instruction Tuning [Paper]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

LLaVA v1.6 changes

LLaVA-1.6 is out! With additional scaling to LLaVA-1.5, LLaVA-1.6-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post!

Summary

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.