daanelson / minigpt-4

A model which generates text in response to an input image and prompt.

  • Public
  • 1.3M runs
  • GitHub
  • Paper
  • License



Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 5 seconds. The predict time for this model varies significantly based on the inputs.


Model description

MiniGPT-4 is a multimodal model which allows users to prompt a language model with an image and some text. This enables users to ask questions about images, generate HTML from website mockups, write advertisements for fictional products, and more. It can function as a chatbot with longer back and forth conversations, though this implementation is a simple question and answer model.

MiniGPT-4 consists of a frozen vision encoder with a pretrained ViT and Q-Former, a single linear projection layer, and a frozen Vicuna large language model. MiniGPT-4 only requires training the linear layer to align the visual features with the Vicuna.

Intended use

MiniGPT-4 is useful for various applications that require image understanding, including: - Describing an image and its context (for instance, - Writing stories about images of characters - Describing recipes from images of food - Etc.


  title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
  author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2304.10592},