VILA
VILA is a visual language model (VLM) pretrained with interleaved image-text data. See the paper and official repo for details.
How to use the API
To use VILA, provide an image and a text prompt. The model will generate a response to the query based on the image. The response is generated by decoding the model’s output using beam search with the specified parameters.
- image: The image to discuss.
- prompt: The query to generate a response for.
- top_p: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens.
- temperature: When decoding text, higher values make the model more creative.
- num_beams: Number of beams to use when decoding text; higher values are slower but more accurate.
- max_tokens: Maximum number of tokens to generate.
References
@misc{lin2023vila,
title={VILA: On Pre-training for Visual Language Models},
author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
year={2023},
eprint={2312.07533},
archivePrefix={arXiv},
primaryClass={cs.CV}
}