daanelson / minigpt-4

A model which generates text in response to an input image and prompt.

  • Public
  • 1.4M runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.014 to run on Replicate, or 71 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 10 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Model description

MiniGPT-4 is a multimodal model which allows users to prompt a language model with an image and some text. This enables users to ask questions about images, generate HTML from website mockups, write advertisements for fictional products, and more. It can function as a chatbot with longer back and forth conversations, though this implementation is a simple question and answer model.

MiniGPT-4 consists of a frozen vision encoder with a pretrained ViT and Q-Former, a single linear projection layer, and a frozen Vicuna large language model. MiniGPT-4 only requires training the linear layer to align the visual features with the Vicuna.

Intended use

MiniGPT-4 is useful for various applications that require image understanding, including: - Describing an image and its context (for instance, - Writing stories about images of characters - Describing recipes from images of food - Etc.

Citation

@article{zhu2023minigpt,
  title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
  author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2304.10592},
  year={2023}
}