andreasjansson/ blip-2

Answers questions about images

Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 2 seconds.

Readme

Unofficial BLIP-2 demo and API

Note that this is an unofficial implementation of BLIP-2 that is not associated with Salesforce.

Usage

Blip-2 is a model that answers questions about images. To use it, provide an image, and then ask a question about that image. For example, you can provide the following image:

Image

and then pose the following question:

What is this a picture of?

and get the output:

marina bay sands, singapore.

Blip-2 is also capable of captioning images. This works by sending the model a blank prompt, though we have an explicit toggle for image captioning in the UI & API.

You can also provide Blip-2 with more context when asking a question. For example, given the following image:

img

you can provide the output of a previous Q&A as context in question: ... answer: ... format like so:

question: what animal is this? answer: panda

and then pose an additional question:

what country is this animal from?

and get the output:

china

Model description

BLIP-2 is a generic and efficient pre-training strategy that easily harvests development of pretrained vision models and large language models (LLMs) for vision-language pretraining. BLIP-2 beats Flamingo on zero-shot VQAv2 (65.0 vs 56.3), establishing new state-of-the-art on zero-shot captioning (on NoCaps 121.6 CIDEr score vs previous best 113.2). Equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the new zero-shot instructed vision-to-language generation capabilities for various interesting applications! Learn more at the official repo

Citation

@misc{https://doi.org/10.48550/arxiv.2301.12597,
  doi = {10.48550/ARXIV.2301.12597},
  url = {https://arxiv.org/abs/2301.12597},
  author = {Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models},
  publisher = {arXiv},
  year = {2023},
  copyright = {Creative Commons Attribution 4.0 International}
}