Caption images
These models generate text descriptions and captions from images. They use large multimodal transformers trained on image-text pairs to understand visual concepts.
Key capabilities:
- Image captioning: Produce relevant captions summarizing image contents and context. Useful for indexing images and accessibility. Automate alt text for images.
- Visual question answering: Generate natural language answers to questions about images. Ask questions about your images.
- Text prompt generation: Create prompts matching image style and content. Use images to guide text-to-image generation.
Our pick: Moondream 2B
Moondream is an efficient, versatile vision language model. It offers a great balance of intelligence to cost, and it can give a detailed caption in just seconds.
A more powerful model: LLaVa 13B
For most people, we recommend the LLaVa 13B model. LLaVa can generate full paragraphs describing an image in depth. It also excels at answering questions about images insightfully.
Budget pick: BLIP
If you need to generate a large volume of image captions or answers and don’t require maximum detail or intelligence, BLIP is a great choice. It performs nearly as well as the more advanced but slower BLIP-2, which makes it significantly cheaper per request
However, BLIP is less capable than Moondream or LLaVa at generating long-form text or exhibiting deeper visual understanding. Stick with our top pick if you need those advanced capabilities.
Featured models

lucataco / moondream2
moondream2 is a small vision language model designed to run efficiently on edge devices
Updated 10 months, 2 weeks ago

yorickvp / llava-13b
Visual instruction tuning towards large language and vision models with GPT-4 level capabilities
Updated 10 months, 3 weeks ago

salesforce / blip
Generate image captions
Updated 2 years, 8 months ago
Recommended models

zsxkib / molmo-7b
allenai/Molmo-7B-D-0924, Answers questions and caption about images
Updated 8 months, 2 weeks ago

fofr / batch-image-captioning
A wrapper model for captioning multiple images using GPT, Claude or Gemini, useful for lora training
Updated 9 months, 4 weeks ago

zsxkib / wd-image-tagger
Image tagger fine-tuned on WaifuDiffusion w/ (SwinV2, SwinV2, ConvNext, and ViT)
Updated 1 year ago

daanelson / minigpt-4
A model which generates text in response to an input image and prompt.
Updated 1 year ago

zsxkib / blip-3
Blip 3 / XGen-MM, Answers questions about images ({blip3,xgen-mm}-phi3-mini-base-r-v1)
Updated 1 year ago

zsxkib / uform-gen
🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️
Updated 1 year, 4 months ago

andreasjansson / blip-2
Answers questions about images
Updated 1 year, 6 months ago

pharmapsychotic / clip-interrogator
The CLIP Interrogator is a prompt engineering tool that combines OpenAI's CLIP and Salesforce's BLIP to optimize text prompts to match a given image. Use the resulting prompts with text-to-image models like Stable Diffusion to create cool art!
Updated 1 year, 9 months ago

nohamoamary / image-captioning-with-visual-attention
datasets: Flickr8k
Updated 2 years, 1 month ago

rmokady / clip_prefix_caption
Simple image captioning model using CLIP and GPT-2
Updated 2 years, 8 months ago

methexis-inc / img2prompt
Get an approximate text prompt, with style, matching an image. (Optimized for stable-diffusion (clip ViT-L/14))
Updated 2 years, 9 months ago

j-min / clip-caption-reward
Fine-grained Image Captioning with CLIP Reward
Updated 3 years ago