zsxkib / blip-3

Blip 3 / XGen-MM, Answers questions about images ({blip3,xgen-mm}-phi3-mini-base-r-v1)

  • Public
  • 1.2M runs
  • A100 (80GB)
  • GitHub
  • License

Input

image
*file

Input image

string
Shift + Return to add a new line

Question to ask about this image

Default: "What is shown in the image?"

string
Shift + Return to add a new line

Optional - previous questions and answers to be used as context for answering current question

boolean

Select if you want to generate image captions instead of asking questions

Default: false

string
Shift + Return to add a new line

System prompt

Default: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."

integer
(minimum: 1, maximum: 2048)

Maximum number of new tokens to generate

Default: 768

integer
(minimum: 1, maximum: 10)

Number of beams for beam search

Default: 1

number
(minimum: 0.5, maximum: 1)

Temperature for use with nucleus sampling

Default: 1

integer
(minimum: 1)

The number of highest probability vocabulary tokens to keep for top-k sampling

Default: 50

number
(minimum: 0, maximum: 1)

The cumulative probability threshold for top-p sampling

Default: 1

number
(minimum: 0)

The parameter for repetition penalty

Default: 1

number
(minimum: 0)

The parameter for length penalty

Default: 1

boolean

Whether to use sampling or not

Default: false

Output

The meme is a humorous representation of the varying levels of effort put into different sections of a handwritten exam. The top section, labeled "First two pages", shows a neatly written page with a clear and legible handwriting. The middle section, labeled "Middle pages", shows a page with messy and illegible handwriting, suggesting that the student may have rushed through these pages. The bottom section, labeled "Last two pages", shows a page with a heartbeat graph, indicating that the student may have been in a state of panic or stress during the exam, leading to a hurried and messy handwriting. The meme is a light-hearted way to comment on the common experience of students who may not put in the same level of effort throughout an exam.
Generated in

This output was created using a different version of the model, zsxkib/blip-3:72044dfa.

Run time and cost

This model costs approximately $0.0024 to run on Replicate, or 416 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 2 seconds.

Readme

Readme

Unofficial BLIP-3 (xgen-mm-phi3-mini-instruct-r-v1) demo and API

Note that this is an unofficial implementation of BLIP-3 (previously known as blip3-phi3-mini-base-r-v1) that is not associated with Salesforce.

Usage

BLIP-3 is a model that answers questions about images. To use it, provide an image, and then ask a question about that image. For example, you can provide the following image:

Marina Bay Sands

and then pose the following question:

What is this a picture of?

and get the output:

Marina Bay Sands, Singapore.

BLIP-3 is also capable of captioning images. This works by sending the model a blank prompt, though we have an explicit toggle for image captioning in the UI & API.

You can also provide BLIP-3 with more context when asking a question. For example, given the following image:

Panda

you can provide the output of a previous Q&A as context in question: … answer: … format like so:

question: What animal is this? answer: A panda

and then pose an additional question:

What country is this animal native to?

and get the output:

China

Model description

XGen-MM (previously known as BLIP-3) is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the BLIP series, incorporating fundamental enhancements that ensure a more robust and superior foundation.

Key features of XGen-MM: - The pretrained foundation model, xgen-mm-phi3-mini-base-r-v1, achieves state-of-the-art performance under 5b parameters and demonstrates strong in-context learning capabilities. - The instruct fine-tuned model, xgen-mm-phi3-mini-instruct-r-v1, achieves state-of-the-art performance among open-source and closed-source VLMs under 5b parameters. - xgen-mm-phi3-mini-instruct-r-v1 supports flexible high-resolution image encoding with efficient visual token sampling.

These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.

Citation

@misc{xgen_mm_phi3_mini, title={xgen-mm-phi3-mini-instruct Model Card}, url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1}, author={Salesforce AI Research}, month={May}, year={2024} }