adirik / vila-2.7b

[Non-commerical] A multi-image visual language model

  • Public
  • 25 runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia A40 GPU hardware.

Readme

VILA

VILA is a visual language model (VLM) pretrained with interleaved image-text data. See the paper and official repo for details.

How to use the API

To use VILA, provide an image and a text prompt. The model will generate a response to the query based on the image. The response is generated by decoding the model’s output using beam search with the specified parameters.

  • image: The image to discuss.
  • prompt: The query to generate a response for.
  • top_p: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens.
  • temperature: When decoding text, higher values make the model more creative.
  • num_beams: Number of beams to use when decoding text; higher values are slower but more accurate.
  • max_tokens: Maximum number of tokens to generate.

References

@misc{lin2023vila,
      title={VILA: On Pre-training for Visual Language Models},
      author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
      year={2023},
      eprint={2312.07533},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}