Description
UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:
- CLIP-like ViT-H/14
- Qwen1.5-0.5B-Chat
Evaluation
Model | LLM Size | SQA | MME | MMBench | Average¹ |
---|---|---|---|---|---|
UForm-Gen2-Qwen-500m | 0.5B | 45.5 | 880.1 | 42.0 | 29.31 |
MobileVLM v2 | 1.4B | 52.1 | 1302.8 | 57.7 | 36.81 |
LLaVA-Phi | 2.7B | 68.4 | 1335.1 | 59.8 | 42.95 |
¹MME scores were divided by 2000 before averaging.