martintmv-git / moondream2

small vision language model

  • Public
  • 69 runs
  • L40S
  • License
Run with an API

Input

image
*file

Input image

string
Shift + Return to add a new line

Input prompt

Default: "Describe this image"

Output

The image features a large, lush green field with a blue sky and white clouds scattered throughout. The field is expansive, covering a significant portion of the scene. The grass appears to be well-maintained and vibrant, creating a serene and peaceful atmosphere. The field is also dotted with a few yellow flowers, adding a touch of color to the landscape. The combination of the open space, the blue sky, and the white clouds creates a picturesque and inviting scene.
Generated in

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Cog implementation of moondream2.

Creator’s GitHub repo: https://github.com/vikhyat/moondream

HF: https://huggingface.co/vikhyatk/moondream2

X / Twitter of creator: https://x.com/vikhyatk

Benchmarks

moondream2 is a 1.86B parameter model initialized with weights from SigLIP and Phi 1.5.

Model VQAv2 GQA TextVQA TallyQA (simple) TallyQA (full)
moondream1 74.7 57.9 35.6 - -
moondream2 (latest) 76.8 60.6 46.4 79.6 73.3

Usage

Using transformers (recommended)

pip install transformers timm einops
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-03-13"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

The model is updated regularly, so we recommend pinning the model version to a specific release as shown above.

To enable Flash Attention on the text model, pass in attn_implementation="flash_attention_2" when instantiating the model.

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision,
    torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).to("cuda")

Batch inference is also supported.

answers = moondream.batch_answer(
    images=[Image.open('<IMAGE_PATH_1>'), Image.open('<IMAGE_PATH_2>')],
    prompts=["Describe this image.", "Are there people in this image?"],
    tokenizer=tokenizer,
)