These models generate text descriptions and captions from images. They use large multimodal transformers trained on image-text pairs to understand visual concepts.
Key capabilities:
Moondream is an efficient, versatile vision language model. It offers a great balance of intelligence to cost, and it can give a detailed caption in just seconds.
For most people, we recommend the LLaVa 13B model. LLaVa can generate full paragraphs describing an image in depth. It also excels at answering questions about images insightfully.
If you need to generate a large volume of image captions or answers and don't require maximum detail or intelligence, BLIP is a great choice. It performs nearly as well as the more advanced but slower BLIP-2, which makes it significantly cheaper per request
However, BLIP is less capable than Moondream or LLaVa at generating long-form text or exhibiting deeper visual understanding. Stick with our top pick if you need those advanced capabilities.
Featured models


lucataco/moondream2
moondream2 is a small vision language model designed to run efficiently on edge devices
Updated 1 year, 2 months ago
4.4M runs


yorickvp/llava-13b
Visual instruction tuning towards large language and vision models with GPT-4 level capabilities
Updated 1 year, 3 months ago
31.7M runs


salesforce/blip
Generate image captions
Updated 3 years ago
168.5M runs
Recommended Models
Recommended Models


lucataco/qwen2-vl-7b-instruct
Latest model in the Qwen family for chatting with video and image models
Updated 10 months ago
260.4K runs


lucataco/ollama-llama3.2-vision-90b
Ollama Llama 3.2 Vision 90B
Updated 10 months, 1 week ago
3.4K runs


lucataco/ollama-llama3.2-vision-11b
Ollama Llama 3.2 Vision 11B
Updated 10 months, 1 week ago
2.7K runs


lucataco/smolvlm-instruct
SmolVLM-Instruct by HuggingFaceTB
Updated 10 months, 3 weeks ago
3.3K runs


lucataco/llama-3-vision-alpha
Projection module trained to add vision capabilties to Llama 3 using SigLIP
Updated 11 months, 2 weeks ago
5.8K runs


zsxkib/molmo-7b
allenai/Molmo-7B-D-0924, Answers questions and caption about images
Updated 1 year ago
579.8K runs


zsxkib/idefics3
Idefics3-8B-Llama3, Answers questions and caption about images
Updated 1 year, 2 months ago
2.5K runs


fofr/deprecated-batch-image-captioning
A wrapper model for captioning multiple images using GPT, Claude or Gemini, useful for lora training
Updated 1 year, 2 months ago
1.5K runs


lucataco/florence-2-base
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Updated 1 year, 3 months ago
125.3K runs


lucataco/sdxl-clip-interrogator
CLIP Interrogator for SDXL optimizes text prompts to match a given image
Updated 1 year, 5 months ago
848.6K runs


daanelson/minigpt-4
A model which generates text in response to an input image and prompt.
Updated 1 year, 5 months ago
1.8M runs


zsxkib/blip-3
Blip 3 / XGen-MM, Answers questions about images ({blip3,xgen-mm}-phi3-mini-base-r-v1)
Updated 1 year, 5 months ago
1.3M runs


zsxkib/uform-gen
🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️
Updated 1 year, 8 months ago
2.3K runs


andreasjansson/blip-2
Answers questions about images
Updated 1 year, 11 months ago
30.9M runs


lucataco/fuyu-8b
Fuyu-8B is a multi-modal text and image transformer trained by Adept AI
Updated 2 years ago
4.6K runs


lucataco/qwen-vl-chat
A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Updated 2 years ago
825.5K runs


pharmapsychotic/clip-interrogator
The CLIP Interrogator is a prompt engineering tool that combines OpenAI's CLIP and Salesforce's BLIP to optimize text prompts to match a given image. Use the resulting prompts with text-to-image models like Stable Diffusion to create cool art!
Updated 2 years, 1 month ago
4.5M runs


nohamoamary/image-captioning-with-visual-attention
datasets: Flickr8k
Updated 2 years, 5 months ago
11.3K runs


rmokady/clip_prefix_caption
Simple image captioning model using CLIP and GPT-2
Updated 3 years ago
1.7M runs


methexis-inc/img2prompt
Get an approximate text prompt, with style, matching an image. (Optimized for stable-diffusion (clip ViT-L/14))
Updated 3 years, 2 months ago
2.7M runs


j-min/clip-caption-reward
Fine-grained Image Captioning with CLIP Reward
Updated 3 years, 4 months ago
296.1K runs