cjwbw / uform-gen2-qwen-500m

Pocket-Sized Multimodal AI For Content Understanding and Generation

  • Public
  • 363 runs
  • License

Input

Output

Run time and cost

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 9 seconds.

Readme

Description

UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

  1. CLIP-like ViT-H/14
  2. Qwen1.5-0.5B-Chat

Evaluation

Model LLM Size SQA MME MMBench Average¹
UForm-Gen2-Qwen-500m 0.5B 45.5 880.1 42.0 29.31
MobileVLM v2 1.4B 52.1 1302.8 57.7 36.81
LLaVA-Phi 2.7B 68.4 1335.1 59.8 42.95

¹MME scores were divided by 2000 before averaging.