adirik/vila-7b | Readme and Docs

VILA

VILA is a visual language model (VLM) pretrained with interleaved image-text data. See the paper and official repo for details.

How to use the API

To use VILA, provide an image and a text prompt. The model will generate a response to the query based on the image. The response is generated by decoding the model’s output using beam search with the specified parameters.

image: The image to discuss.
prompt: The query to generate a response for.
top_p: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens.
temperature: When decoding text, higher values make the model more creative.
num_beams: Number of beams to use when decoding text; higher values are slower but more accurate.
max_tokens: Maximum number of tokens to generate.

References

@misc{lin2023vila,
      title={VILA: On Pre-training for Visual Language Models},
      author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
      year={2023},
      eprint={2312.07533},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Model created over 1 year ago