cjwbw / uform-gen2-qwen-500m

Pocket-Sized Multimodal AI For Content Understanding and Generation

  • Public
  • 397 runs
  • L40S
  • License

Input

image
*file

Input image.

string
Shift + Return to add a new line

Question or Instruction.

Default: "Describe the image in three sentences."

integer

Max num of token to generate.

Default: 256

Output

A white and orange cat stands on its hind legs, reaching for a white teapot on a wooden table in a garden. The teapot is on a white tablecloth, and a basket of red raspberries is nearby. The cat's position and actions create a playful and charming scene.<|im_end|>
Generated in

Run time and cost

This model costs approximately $0.0079 to run on Replicate, or 126 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 9 seconds.

Readme

Description

UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

  1. CLIP-like ViT-H/14
  2. Qwen1.5-0.5B-Chat

Evaluation

Model LLM Size SQA MME MMBench Average¹
UForm-Gen2-Qwen-500m 0.5B 45.5 880.1 42.0 29.31
MobileVLM v2 1.4B 52.1 1302.8 57.7 36.81
LLaVA-Phi 2.7B 68.4 1335.1 59.8 42.95

¹MME scores were divided by 2000 before averaging.