daanelson / minigpt-4

A model which generates text in response to an input image and prompt.

  • Public
  • 1.8M runs
  • A100 (80GB)
  • GitHub
  • Paper
  • License
Iterate in playground

Input

image
*file

Image to discuss

*string
Shift + Return to add a new line

Prompt for mini-gpt4 regarding input image

integer
(minimum: 1, maximum: 10)

Number of beams for beam search decoding

Default: 3

number
(minimum: 0.01, maximum: 2)

Temperature for generating tokens, lower = more predictable results

Default: 1

number
(minimum: 0, maximum: 1)

Sample from the top p percent most likely tokens

Default: 0.9

number
(minimum: 0.01, maximum: 5)

Penalty for repeated words in generated text; 1 is no penalty, values greater than 1 discourage repetition, less than 1 encourage it.

Default: 1

integer
(minimum: 1)

Maximum number of new tokens to generate

Default: 3000

integer
(minimum: 1)

Total length of prompt and output in tokens

Default: 4000

Output

Dave the llama was feeling very bored one day. He had been wandering around the city for hours, but there was nothing interesting to do. Suddenly, he saw a skateboard lying on the ground. He decided to try it out, and as soon as he started riding it, he felt a rush of excitement. He rode around the city, enjoying the feeling of the wind in his hair and the freedom of being on his own. As he rode, he saw all sorts of interesting things that he had never noticed before. He even met some new friends along the way. After a while, Dave realized that he had found his true passion - skateboarding. From then on, he spent all his free time riding his skateboard and exploring the city.
Generated in

This output was created using a different version of the model, daanelson/minigpt-4:b96a2f33.

Run time and cost

This model costs approximately $0.0057 to run on Replicate, or 175 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 5 seconds.

Readme

Model description

MiniGPT-4 is a multimodal model which allows users to prompt a language model with an image and some text. This enables users to ask questions about images, generate HTML from website mockups, write advertisements for fictional products, and more. It can function as a chatbot with longer back and forth conversations, though this implementation is a simple question and answer model.

MiniGPT-4 consists of a frozen vision encoder with a pretrained ViT and Q-Former, a single linear projection layer, and a frozen Vicuna large language model. MiniGPT-4 only requires training the linear layer to align the visual features with the Vicuna.

Intended use

MiniGPT-4 is useful for various applications that require image understanding, including: - Describing an image and its context (for instance, - Writing stories about images of characters - Describing recipes from images of food - Etc.

Citation

@article{zhu2023minigpt,
  title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
  author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2304.10592},
  year={2023}
}