InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
InstructBLIP is an instruction tuned image captioning model.
From the project page:
“The response from InstructBLIP is more comprehensive than GPT-4, more visually-grounded than LLaVA, and more logical than MiniGPT-4. The responses of GPT-4 and LLaVA are obtained from their respective papers, while the official demo is used for MiniGPT-4.”