adirik / bunny-phi-2-siglip

Lightweight multimodal model for visual question answering, reasoning and captioning

  • Public
  • 3.9K runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.0013 to run on Replicate, or 769 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A40 GPU hardware. Predictions typically complete within 3 seconds.

Readme

Bunny

A lightweight multimodal large language model based on SigLIP and Phi-2 by BAAI-DCAI. See the original repo, technical report and official demo for details.

How to use the API

The API input arguments are as follows:

image: Path to input image to be queried.
prompt: Text prompt to caption or query the input image with.
temperature: Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic.
top_p: Samples from top_p percentage of most likely tokens during decoding. max_new_tokens: Maximum number of tokens to generate. A word is generally 2-3 tokens.

References

@article{he2024bunny,
      title={Efficient Multimodal Learning from Data-centric Perspective}, 
      author={He, Muyang and Liu, Yexin and Wu, Boya and Yuan, Jianhao and Wang, Yueze and Huang, Tiejun and Zhao, Bo},
      journal={arXiv preprint arXiv:2402.11530},
      year={2024}
}