martintmv-git/moondream2 | Run with an API on Replicate

Readme

Cog implementation of moondream2.

Creator’s GitHub repo: https://github.com/vikhyat/moondream

HF: https://huggingface.co/vikhyatk/moondream2

X / Twitter of creator: https://x.com/vikhyatk

Benchmarks

moondream2 is a 1.86B parameter model initialized with weights from SigLIP and Phi 1.5.

Model	VQAv2	GQA	TextVQA	TallyQA (simple)	TallyQA (full)
moondream1	74.7	57.9	35.6	-	-
moondream2 (latest)	76.8	60.6	46.4	79.6	73.3

Usage

Using transformers (recommended)

pip install transformers timm einops

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-03-13"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

The model is updated regularly, so we recommend pinning the model version to a specific release as shown above.

To enable Flash Attention on the text model, pass in attn_implementation="flash_attention_2" when instantiating the model.

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision,
    torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).to("cuda")

Batch inference is also supported.

answers = moondream.batch_answer(
    images=[Image.open('<IMAGE_PATH_1>'), Image.open('<IMAGE_PATH_2>')],
    prompts=["Describe this image.", "Are there people in this image?"],
    tokenizer=tokenizer,
)

Run time and cost

Readme

Benchmarks

Usage