cjwbw/uform-gen2-qwen-500m | Run with an API on Replicate

Pocket-Sized Multimodal AI For Content Understanding and Generation

Public

413 runs

License

Run time and cost

This model costs approximately $0.0079 to run on Replicate, or 126 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 9 seconds.

Readme

Description

UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

CLIP-like ViT-H/14
Qwen1.5-0.5B-Chat

Evaluation

Model	LLM Size	SQA	MME	MMBench	Average¹
UForm-Gen2-Qwen-500m	0.5B	45.5	880.1	42.0	29.31
MobileVLM v2	1.4B	52.1	1302.8	57.7	36.81
LLaVA-Phi	2.7B	68.4	1335.1	59.8	42.95

¹MME scores were divided by 2000 before averaging.

Model created over 1 year ago