Model description
MiniGPT-4 is a multimodal model which allows users to prompt a language model with an image and some text. This enables users to ask questions about images, generate HTML from website mockups, write advertisements for fictional products, and more. It can function as a chatbot with longer back and forth conversations, though this implementation is a simple question and answer model.
MiniGPT-4 consists of a frozen vision encoder with a pretrained ViT and Q-Former, a single linear projection layer, and a frozen Vicuna large language model. MiniGPT-4 only requires training the linear layer to align the visual features with the Vicuna.
Intended use
MiniGPT-4 is useful for various applications that require image understanding, including: - Describing an image and its context (for instance, - Writing stories about images of characters - Describing recipes from images of food - Etc.
Citation
@article{zhu2023minigpt,
  title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
  author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2304.10592},
  year={2023}
}
