adirik / vila-7b

[Non-commerical] A multi-image visual language model

  • Public
  • 2.1K runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.0094 to run on Replicate, or 106 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 10 seconds.

Readme

VILA

VILA is a visual language model (VLM) pretrained with interleaved image-text data. See the paper and official repo for details.

How to use the API

To use VILA, provide an image and a text prompt. The model will generate a response to the query based on the image. The response is generated by decoding the model’s output using beam search with the specified parameters.

  • image: The image to discuss.
  • prompt: The query to generate a response for.
  • top_p: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens.
  • temperature: When decoding text, higher values make the model more creative.
  • num_beams: Number of beams to use when decoding text; higher values are slower but more accurate.
  • max_tokens: Maximum number of tokens to generate.

References

@misc{lin2023vila,
      title={VILA: On Pre-training for Visual Language Models},
      author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
      year={2023},
      eprint={2312.07533},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}