Readme
Model description
BLIP3
is a series of foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. \
These models have been trained at scale on high-quality image caption datasets and interleaved image-text data. BLIP3 highlights a few features below,
- The pretrained foundation model,
blip3-phi3-mini-base-r-v1
, achieves state-of-the-art performance under 5b parameters and demonstrates strong in-context learning capabilities. - The instruct fine-tuned model,
blip3-phi3-mini-instruct-r-v1
, achieves state-of-the-art performance among open-source and closed-source VLMs under 5b parameters. blip3-phi3-mini-instruct-r-v1
supports flexible high-resolution image encoding with efficient visual token sampling.
More technical details will come with a technical report soon.
Datasets
Dataset Type | Dataset(s) Used |
---|---|
Pretrain | caption data: (datacomp, cc12m, cc3m, SBU, vg) && interleaved data: obelics |
Instruction Tuning | LLaVA-Instruct-150K, ShareGPT4V captions, a mixture of academic VQA data including OCR/Document/Chart-focused tasks, publicly available text-only instruction data |
Results
Pretrain
Model | Shot | COCO (val) | NoCaps (val) | TextCaps (val) | OKVQA (val) | TextVQA (val) | VizWiz (testdev) | VQAv2 (testdev) |
---|---|---|---|---|---|---|---|---|
Flamingo-3B | 4 | 85.0 | - | - | 43.3 | 32.7 | 34 | 53.2 |
8 | 90.6 | - | - | 44.6 | 32.4 | 38.4 | 55.4 | |
MM1-3B | 0 | 73.5 | 55.6 | 63.3 | 26.1 | 29.4 | 15.6 | 46.2 |
4 | 112.3 | 99.7 | 84.1 | 48.6 | 45.3 | 38.0 | 57.9 | |
8 | 114.6 | 104.7 | 88.8 | 48.4 | 44.6 | 46.4 | 63.6 | |
blip3-phi3-mini-base-r-v1 (Ours) | 0 | 81.7 | 80.2 | 60.7 | 26.5 | 36.0 | 21.2 | 48.1 |
4 | 110.5 | 101.7 | 84.6 | 49.2 | 46.1 | 38.4 | 63.9 | |
8 | 112.1 | 104.4 | 87.7 | 49.1 | 46.4 | 44.3 | 63.8 |
Instruct
Model | SEED-IMG | MMBench(dev) | MME-total | MME-P | MME-C | MMStar | MMMU (val) | MMVet | MathVista (mini) | ScienceQA (test) | POPE | AI2D | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MM1-3B-Chat | 68.8 | 75.9 | 1761 | 1482 | 279 | - | 33.9 | 43.7 | - | - | 87.4 | - | |
openbmb/MiniCPM-V-2 | 67.1 | 69.6 | 1808 | - | - | - | 38.2 | - | 38.7 | - | - | - | |
VILA1.5-3B | 67.9 | 63.4 | - | 1442 | - | - | 33.3 | 35.4 | - | 69.0 | 85.9 | - | |
xtuner/llava-phi-3-mini-hf | 70.0 | 69.2 | 1790 | 1477 | 313 | 43.7 | 41.4 | - | - | 73.7 | 87.3 | 69.3 | |
blip3-phi3-mini-instruct-r-v1 (Ours) | 72.1 | 74.1 | 1827 | 1467 | 360 | 44.6 | 39.8 | 45.1 | 39.3 | 74.2 | 87.2 | 75.8 |
More comprehensive examples can be found in the notebook.
Reproducibility:
Our SFT evaluation is based on the VLMEvalKit, in which we fixed some inconsistencies with the official benchmarks (e.g., LLM judge API). During our development, we noticed that the raw resolution of the input image would noticeably affect the model output in some cases.
Bias, Risks, Limitations, and Ethical Considerations
The main data sources are from the internet, including webpages, image stock sites, and curated datasets released by the research community. We have excluded certain data, such as LAION, due to known CSAM concerns. The model may be subject to bias from the original data source, as well as bias from LLMs and commercial APIs. We strongly recommend users assess safety and fairness before applying to downstream applications.
License
Our code and weights are released under the Creative Commons Attribution Non Commercial 4.0 LICENSE. Please fill out a form at here to consult the commercial use of model weights.
Code acknowledgement
LAVIS \ openflamingo \ VLMEvalKit
Citation
@misc{blip3_phi3_mini,
title={BLIP3-phi3-mini-instruct Model Card},
url={https://huggingface.co/Salesforce/blip3-phi3-mini-instruct-r-v1},
author={Salesforce AI Research},
month={May},
year={2024}
}