jimothyjohn / phi3-vision-instruct

A soon-to-be accelerated endpoint for multi-modal inference.

  • Public
  • 135 runs
  • GitHub
  • Weights
  • Paper
  • License

Run time and cost

This model runs on Nvidia A100 (80GB) GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Phi3.5-Vision-Instruct

Phi-3.5-vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K token context length.

TODO

Notes

99% of the code in this project was generated by ChatGPT’s o1-preview. I was hoping to use example files and explicit instructions to build this out autonomously and was very impressed by the result. My involvement was limited to the initial prompt and troubleshooting error outputs.