gfodor / instructblip

Image captioning via vision-language models with instruction tuning

  • Public
  • 522.2K runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 75 seconds. The predict time for this model varies significantly based on the inputs.

Readme

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

project page paper

InstructBLIP is an instruction tuned image captioning model.

Comparison

From the project page:

“The response from InstructBLIP is more comprehensive than GPT-4, more visually-grounded than LLaVA, and more logical than MiniGPT-4. The responses of GPT-4 and LLaVA are obtained from their respective papers, while the official demo is used for MiniGPT-4.”