lucataco / llava-phi-3-mini

llava-phi-3-mini is a LLaVA model fine-tuned from microsoft/Phi-3-mini-4k-instruct

  • Public
  • 2.6K runs
  • GitHub
  • License

Input

Output

Run time and cost

This model runs on Nvidia A40 GPU hardware. Predictions typically complete within 6 seconds.

Readme

llava-phi-3-mini

llava-phi-3-mini is a LLaVA model fine-tuned from microsoft/Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.

Note: This model is in HuggingFace LLaVA format.

Resources:

Details

Model Visual Encoder Projector Resolution Pretraining Strategy Fine-tuning Strategy Pretrain Dataset Fine-tune Dataset Pretrain Epoch Fine-tune Epoch
LLaVA-v1.5-7B CLIP-L MLP 336 Frozen LLM, Frozen ViT Full LLM, Frozen ViT LLaVA-PT (558K) LLaVA-Mix (665K) 1 1
LLaVA-Llama-3-8B CLIP-L MLP 336 Frozen LLM, Frozen ViT Full LLM, LoRA ViT LLaVA-PT (558K) LLaVA-Mix (665K) 1 1
LLaVA-Llama-3-8B-v1.1 CLIP-L MLP 336 Frozen LLM, Frozen ViT Full LLM, LoRA ViT ShareGPT4V-PT (1246K) InternVL-SFT (1268K) 1 1
LLaVA-Phi-3-mini CLIP-L MLP 336 Frozen LLM, Frozen ViT Full LLM, Full ViT ShareGPT4V-PT (1246K) InternVL-SFT (1268K) 1 2

Results

<div align="center"> Image </div>
Model MMBench Test (EN) MMMU Val SEED-IMG AI2D Test ScienceQA Test HallusionBench aAcc POPE GQA TextVQA MME MMStar
LLaVA-v1.5-7B 66.5 35.3 60.5 54.8 70.4 44.9 85.9 62.0 58.2 1511/348 30.3
LLaVA-Llama-3-8B 68.9 36.8 69.8 60.9 73.3 47.3 87.2 63.5 58.0 1506/295 38.2
LLaVA-Llama-3-8B-v1.1 72.3 37.1 70.1 70.0 72.9 47.7 86.4 62.6 59.0 1469/349 45.1
LLaVA-Phi-3-mini 69.2 41.4 70.0 69.3 73.7 49.8 87.3 61.5 57.8 1477/313 43.7

Reproduce

Please refer to docs.

Citation

@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}