jimothyjohn / phi3-vision-instruct

A soon-to-be accelerated endpoint for multi-modal inference.

  • Public
  • 126 runs
  • GitHub
  • Weights
  • Paper
  • License

Phi3.5-Vision-Instruct

Phi-3.5-vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K token context length.

TODO

Notes

99% of the code in this project was generated by ChatGPT’s o1-preview. I was hoping to use example files and explicit instructions to build this out autonomously and was very impressed by the result. My involvement was limited to the initial prompt and troubleshooting error outputs.