adirik / bunny-phi-2-siglip

Lightweight multimodal model for visual question answering, reasoning and captioning

  • Public
  • 2.1K runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia A40 GPU hardware. Predictions typically complete within 3 seconds.

Readme

Bunny

A lightweight multimodal large language model based on SigLIP and Phi-2 by BAAI-DCAI. See the original repo, technical report and official demo for details.

How to use the API

The API input arguments are as follows:

image: Path to input image to be queried.
prompt: Text prompt to caption or query the input image with.
temperature: Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic.
top_p: Samples from top_p percentage of most likely tokens during decoding. max_new_tokens: Maximum number of tokens to generate. A word is generally 2-3 tokens.

References

@article{he2024bunny,
      title={Efficient Multimodal Learning from Data-centric Perspective}, 
      author={He, Muyang and Liu, Yexin and Wu, Boya and Yuan, Jianhao and Wang, Yueze and Huang, Tiejun and Zhao, Bo},
      journal={arXiv preprint arXiv:2402.11530},
      year={2024}
}