Phi3.5-Vision-Instruct

Phi-3.5-vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K token context length.

TODO

Accelerate with ONNX version (once available)
Experiment with AWQ int4 version (once available)

Notes

99% of the code in this project was generated by ChatGPT’s o1-preview. I was hoping to use example files and explicit instructions to build this out autonomously and was very impressed by the result. My involvement was limited to the initial prompt and troubleshooting error outputs.

Model created over 1 year ago