gfodor/instructblip

Public Image captioning via vision-language models with instruction tuning
Demo API Examples Versions (ca869b56)

Run time and cost

Predictions run on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 7 seconds. The predict time for this model varies significantly based on the inputs.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

project page paper

InstructBLIP is an instruction tuned image captioning model.

Comparison

From the project page:

"The response from InstructBLIP is more comprehensive than GPT-4, more visually-grounded than LLaVA, and more logical than MiniGPT-4. The responses of GPT-4 and LLaVA are obtained from their respective papers, while the official demo is used for MiniGPT-4."