An instruction-tuned multi-modal model based on BLIP-2 and Vicuna-13B

  • Public
  • 251.3K runs

Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 82 seconds. The predict time for this model varies significantly based on the inputs.



This model generates text that is conditioned on both text and image prompts. Unlike standard multi-modal models, it has also be fine-tuned to follow human instructions.

This model was developed by Salesforce and this implementation is an unofficial version based on their open source implementation.

Model description

This model was developed by applying a multi-modal instruction tuning framework–InstructBLIP–to a pre-trained BLIP-2 model. In this case, the BLIP-2 model was composed of Vicuna-13B, an image encoder, and a Q-Former, which bridges the LLM and image encoder. To instruction-tune the model, the LLM and image encoder weights were frozen and only the Q-Former weights were updated.

To implement this model for Replicate, we introduced several modifications to the original code. To minimize the time it takes to initialize the model on inference instances, we tensorized the Vicuna-13B weights and we download and load the weights for each component of the model in parallel. We also modified the generation method to support token streaming.

Intended use

The model is intended and licensed for research use only. InstructBLIP w/ Vicuna models are restricted to uses that follow the license agreement of LLaMA and Vicuna. The models have been trained on the LLaVA dataset which is CC BY NC 4.0 (allowing only non-commercial use).