zsxkib / blip-3

Blip 3 / XGen-MM, Answers questions about images ({blip3,xgen-mm}-phi3-mini-base-r-v1)

  • Public
  • 113 runs
  • GitHub
  • License



Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware.



Unofficial BLIP-3 (xgen-mm-phi3-mini-instruct-r-v1) demo and API

Note that this is an unofficial implementation of BLIP-3 (previously known as blip3-phi3-mini-base-r-v1) that is not associated with Salesforce.


BLIP-3 is a model that answers questions about images. To use it, provide an image, and then ask a question about that image. For example, you can provide the following image:

Marina Bay Sands

and then pose the following question:

What is this a picture of?

and get the output:

Marina Bay Sands, Singapore.

BLIP-3 is also capable of captioning images. This works by sending the model a blank prompt, though we have an explicit toggle for image captioning in the UI & API.

You can also provide BLIP-3 with more context when asking a question. For example, given the following image:


you can provide the output of a previous Q&A as context in question: … answer: … format like so:

question: What animal is this? answer: A panda

and then pose an additional question:

What country is this animal native to?

and get the output:


Model description

XGen-MM (previously known as BLIP-3) is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the BLIP series, incorporating fundamental enhancements that ensure a more robust and superior foundation.

Key features of XGen-MM: - The pretrained foundation model, xgen-mm-phi3-mini-base-r-v1, achieves state-of-the-art performance under 5b parameters and demonstrates strong in-context learning capabilities. - The instruct fine-tuned model, xgen-mm-phi3-mini-instruct-r-v1, achieves state-of-the-art performance among open-source and closed-source VLMs under 5b parameters. - xgen-mm-phi3-mini-instruct-r-v1 supports flexible high-resolution image encoding with efficient visual token sampling.

These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.


@misc{xgen_mm_phi3_mini, title={xgen-mm-phi3-mini-instruct Model Card}, url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1}, author={Salesforce AI Research}, month={May}, year={2024} }