cjwbw/uform-gen2-qwen-500m | Readme and Docs

Description

UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

CLIP-like ViT-H/14
Qwen1.5-0.5B-Chat

Evaluation

Model	LLM Size	SQA	MME	MMBench	Average¹
UForm-Gen2-Qwen-500m	0.5B	45.5	880.1	42.0	29.31
MobileVLM v2	1.4B	52.1	1302.8	57.7	36.81
LLaVA-Phi	2.7B	68.4	1335.1	59.8	42.95

¹MME scores were divided by 2000 before averaging.

Model created over 1 year ago