Bunny
A lightweight multimodal large language model based on SigLIP and Phi-2 by BAAI-DCAI. See the original repo, technical report and official demo for details.
How to use the API
The API input arguments are as follows:
image: Path to input image to be queried.
prompt: Text prompt to caption or query the input image with.
temperature: Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic.
top_p: Samples from top_p percentage of most likely tokens during decoding.
max_new_tokens: Maximum number of tokens to generate. A word is generally 2-3 tokens.
References
@article{he2024bunny,
title={Efficient Multimodal Learning from Data-centric Perspective},
author={He, Muyang and Liu, Yexin and Wu, Boya and Yuan, Jianhao and Wang, Yueze and Huang, Tiejun and Zhao, Bo},
journal={arXiv preprint arXiv:2402.11530},
year={2024}
}