Caption images

These models generate text descriptions and captions from images. They use large multimodal transformers trained on image-text pairs to understand visual concepts.

Key capabilities:

Image captioning: Produce relevant captions summarizing image contents and context. Useful for indexing images and accessibility. Automate alt text for images.
Visual question answering: Generate natural language answers to questions about images. Ask questions about your images.
Text prompt generation: Create prompts matching image style and content. Use images to guide text-to-image generation.

Our pick: Moondream 2B

Moondream is an efficient, versatile vision language model. It offers a great balance of intelligence to cost, and it can give a detailed caption in just seconds.

A more powerful model: LLaVa 13B

For most people, we recommend the LLaVa 13B model. LLaVa can generate full paragraphs describing an image in depth. It also excels at answering questions about images insightfully.

Budget pick: BLIP

If you need to generate a large volume of image captions or answers and don’t require maximum detail or intelligence, BLIP is a great choice. It performs nearly as well as the more advanced but slower BLIP-2, which makes it significantly cheaper per request

However, BLIP is less capable than Moondream or LLaVa at generating long-form text or exhibiting deeper visual understanding. Stick with our top pick if you need those advanced capabilities.

Featured models

lucataco / moondream2

moondream2 is a small vision language model designed to run efficiently on edge devices

Updated 10 months, 4 weeks ago

715.4K runs

yorickvp / llava-13b

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities

Updated 11 months, 1 week ago

27.9M runs

salesforce / blip

Generate image captions

Updated 2 years, 8 months ago

166.6M runs

Recommended models

lucataco / qwen2-vl-7b-instruct

Latest model in the Qwen family for chatting with video and image models

Updated 6 months, 1 week ago

157.1K runs

lucataco / ollama-llama3.2-vision-90b

Ollama Llama 3.2 Vision 90B

Updated 6 months, 1 week ago

3K runs

lucataco / ollama-llama3.2-vision-11b

Ollama Llama 3.2 Vision 11B

Updated 6 months, 1 week ago

1.9K runs

lucataco / smolvlm-instruct

SmolVLM-Instruct by HuggingFaceTB

Updated 6 months, 4 weeks ago

1.1K runs

lucataco / llama-3-vision-alpha

Projection module trained to add vision capabilties to Llama 3 using SigLIP

Updated 7 months, 3 weeks ago

5.6K runs

zsxkib / molmo-7b

allenai/Molmo-7B-D-0924, Answers questions and caption about images

Updated 9 months ago

114.7K runs

zsxkib / idefics3

Idefics3-8B-Llama3, Answers questions and caption about images

Updated 10 months, 1 week ago

2.4K runs

fofr / batch-image-captioning

A wrapper model for captioning multiple images using GPT, Claude or Gemini, useful for lora training

Updated 10 months, 2 weeks ago

1.4K runs

lucataco / florence-2-base

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Updated 1 year ago

70K runs

lucataco / sdxl-clip-interrogator

CLIP Interrogator for SDXL optimizes text prompts to match a given image

Updated 1 year, 1 month ago

847.5K runs

daanelson / minigpt-4

A model which generates text in response to an input image and prompt.

Updated 1 year, 1 month ago

1.8M runs

zsxkib / blip-3

Blip 3 / XGen-MM, Answers questions about images ({blip3,xgen-mm}-phi3-mini-base-r-v1)

Updated 1 year, 1 month ago

1.3M runs

zsxkib / uform-gen

🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️

Updated 1 year, 4 months ago

2.3K runs

andreasjansson / blip-2

Answers questions about images

Updated 1 year, 7 months ago

30.2M runs

lucataco / fuyu-8b

Fuyu-8B is a multi-modal text and image transformer trained by Adept AI

Updated 1 year, 8 months ago

4.6K runs

lucataco / qwen-vl-chat

A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.

Updated 1 year, 8 months ago

825.3K runs

pharmapsychotic / clip-interrogator

The CLIP Interrogator is a prompt engineering tool that combines OpenAI's CLIP and Salesforce's BLIP to optimize text prompts to match a given image. Use the resulting prompts with text-to-image models like Stable Diffusion to create cool art!

Updated 1 year, 9 months ago

3.8M runs