These models generate text descriptions and captions from images. They use large multimodal transformers trained on image-text pairs to understand visual concepts.
Key capabilities:
Moondream is an efficient, versatile vision language model. It offers a great balance of intelligence to cost, and it can give a detailed caption in just seconds.
For most people, we recommend the LLaVa 13B model. LLaVa can generate full paragraphs describing an image in depth. It also excels at answering questions about images insightfully.
If you need to generate a large volume of image captions or answers and don't require maximum detail or intelligence, BLIP is a great choice. It performs nearly as well as the more advanced but slower BLIP-2, which makes it significantly cheaper per request
However, BLIP is less capable than Moondream or LLaVa at generating long-form text or exhibiting deeper visual understanding. Stick with our top pick if you need those advanced capabilities.
Featured models

lucataco/moondream2moondream2 is a small vision language model designed to run efficiently on edge devices
Updated 1 year, 4 months ago
5.8M runs

yorickvp/llava-13bVisual instruction tuning towards large language and vision models with GPT-4 level capabilities
Updated 1 year, 4 months ago
32.8M runs

salesforce/blipGenerate image captions
Updated 3 years, 2 months ago
169.1M runs
Recommended Models
If you need quick captions, lucataco/moondream2 is one of the speedier models in the image-to-text collection. It’s optimized to produce short, relevant descriptions without long processing times.
Faster captioning models are best when you need basic descriptions or alt text at scale.
yorickvp/llava-13b is a strong option when you need richer, more descriptive outputs. It can handle both simple captioning and more complex visual question answering (VQA), like identifying actions or objects in a scene.
If your goal is accessibility, search indexing, or descriptive tags, lucataco/moondream2 gives you good coverage without long waits.
For straightforward image descriptions—like alt text, SEO tags, or catalog metadata—lucataco/moondream2 is a great fit. It generates clear, concise captions that describe what’s in an image.
If you need more context or nuance in those descriptions, switch to a more expressive model like yorickvp/llava-13b.
For interactive use cases—like asking “What is this person doing?” or “How many people are here?”—pick a model that supports VQA (visual question answering), such as yorickvp/llava-13b.
These models let you pass both an image and a text question to get a natural language answer.
Depending on the model, you may get:
You can package your own image captioning or VQA model with Cog and push it to Replicate. Define your inputs (image and optional question) and outputs (text caption or answer), and set your versioning and sharing settings.
This gives you control over how the model runs and is shared.
Many models in the image-to-text collection support commercial use, but licenses vary. Always check the model card for attribution requirements or restrictions before using outputs in production.
Recommended Models

lucataco/qwen2-vl-7b-instructLatest model in the Qwen family for chatting with video and image models
Updated 11 months, 2 weeks ago
304.8K runs

lucataco/ollama-llama3.2-vision-90bOllama Llama 3.2 Vision 90B
Updated 11 months, 2 weeks ago
3.5K runs

lucataco/ollama-llama3.2-vision-11bOllama Llama 3.2 Vision 11B
Updated 11 months, 2 weeks ago
3.3K runs

lucataco/smolvlm-instructSmolVLM-Instruct by HuggingFaceTB
Updated 1 year ago
8.2K runs

lucataco/llama-3-vision-alphaProjection module trained to add vision capabilties to Llama 3 using SigLIP
Updated 1 year, 1 month ago
5.9K runs

zsxkib/molmo-7ballenai/Molmo-7B-D-0924, Answers questions and caption about images
Updated 1 year, 2 months ago
984.9K runs

zsxkib/idefics3Idefics3-8B-Llama3, Answers questions and caption about images
Updated 1 year, 3 months ago
2.6K runs

fofr/deprecated-batch-image-captioningA wrapper model for captioning multiple images using GPT, Claude or Gemini, useful for lora training
Updated 1 year, 3 months ago
1.6K runs

lucataco/florence-2-baseFlorence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Updated 1 year, 5 months ago
128.6K runs

lucataco/sdxl-clip-interrogatorCLIP Interrogator for SDXL optimizes text prompts to match a given image
Updated 1 year, 6 months ago
848.7K runs

daanelson/minigpt-4A model which generates text in response to an input image and prompt.
Updated 1 year, 6 months ago
1.8M runs

zsxkib/blip-3Blip 3 / XGen-MM, Answers questions about images ({blip3,xgen-mm}-phi3-mini-base-r-v1)
Updated 1 year, 6 months ago
1.3M runs

zsxkib/uform-gen🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️
Updated 1 year, 10 months ago
2.3K runs

andreasjansson/blip-2Answers questions about images
Updated 2 years ago
31.1M runs

lucataco/fuyu-8bFuyu-8B is a multi-modal text and image transformer trained by Adept AI
Updated 2 years, 1 month ago
14.6K runs

lucataco/qwen-vl-chatA multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Updated 2 years, 1 month ago
825.6K runs

pharmapsychotic/clip-interrogatorThe CLIP Interrogator is a prompt engineering tool that combines OpenAI's CLIP and Salesforce's BLIP to optimize text prompts to match a given image. Use the resulting prompts with text-to-image models like Stable Diffusion to create cool art!
Updated 2 years, 2 months ago
4.8M runs

nohamoamary/image-captioning-with-visual-attentiondatasets: Flickr8k
Updated 2 years, 7 months ago
11.3K runs

rmokady/clip_prefix_captionSimple image captioning model using CLIP and GPT-2
Updated 3 years, 2 months ago
1.7M runs

methexis-inc/img2promptGet an approximate text prompt, with style, matching an image. (Optimized for stable-diffusion (clip ViT-L/14))
Updated 3 years, 3 months ago
2.7M runs

j-min/clip-caption-rewardFine-grained Image Captioning with CLIP Reward
Updated 3 years, 6 months ago
296.1K runs