Vision models process and interpret visual information from images and videos. You can use vision models to answer questions about the content of an image, identify and locate objects, etc.
For example, you can use the yorickvp/llava-13b vision model to generate recipe ideas from an image of your fridge
If you don't need reasoning abilities and just want to get descriptions of images, check out our image captioning collection →
Featured models


openai/gpt-4o-mini
Low latency, low cost version of OpenAI's GPT-4o model
Updated 2 months, 2 weeks ago
3.7M runs

anthropic/claude-4-sonnet
Claude Sonnet 4 is a significant upgrade to 3.7, delivering superior coding and reasoning while responding more precisely to your instructions
Updated 4 months, 2 weeks ago
1M runs


yorickvp/llava-13b
Visual instruction tuning towards large language and vision models with GPT-4 level capabilities
Updated 1 year, 3 months ago
31.9M runs
Recommended Models
If you want a low-latency model that can understand and talk about images, lighter options in the Vision Models collection—such as openai/gpt-4o-mini—are well-suited for quick interactions.
Faster models work well for straightforward scenes and short prompts. For more complex reasoning, you may prefer a larger model.
yorickvp/llava-13b is a strong all-around model in the Vision Models collection. It supports both image captioning and visual question answering (VQA), producing more descriptive and context-aware responses.
If you just need a quick caption, smaller models can be faster and cheaper to run.
For interactive use—like asking “What’s happening in this photo?” or “How many people are in the image?”—pick a model that supports image + text prompting, such as yorickvp/llava-13b.
These models can interpret your image in context and give natural language answers.
If you just want a brief description (e.g., for accessibility, alt text, or indexing), lighter models like openai/gpt-4o-mini can provide fast, simple captions.
You don’t need a heavyweight model unless your task requires more complex reasoning.
Most models in this collection return:
You can package your own multi-modal model with Cog and push it to Replicate.
Define your input schema (e.g., image + optional question) and output (caption or answer), then publish it to share or use commercially.
Many models in the Vision Models collection allow commercial use, but licenses vary. Always check the model card for attribution requirements or restrictions before using outputs commercially.
Recommended Models


openai/gpt-4.1-mini
Fast, affordable version of GPT-4.1
Updated 1 month, 2 weeks ago
1.3M runs


openai/gpt-4o
OpenAI's high-intelligence chat model
Updated 1 month, 3 weeks ago
257.9K runs


lucataco/qwen2.5-omni-7b
Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
Updated 6 months, 3 weeks ago
19.5K runs

anthropic/claude-3.7-sonnet
The most intelligent Claude model and the first hybrid reasoning model on the market (claude-3-7-sonnet-20250219)
Updated 8 months ago
3M runs

anthropic/claude-3.5-sonnet
Anthropic's most intelligent language model to date, with a 200K token context window and image understanding (claude-3-5-sonnet-20241022)
Updated 8 months, 2 weeks ago
581.9K runs


lucataco/qwen2-vl-7b-instruct
Latest model in the Qwen family for chatting with video and image models
Updated 10 months, 1 week ago
261.9K runs


lucataco/ollama-llama3.2-vision-90b
Ollama Llama 3.2 Vision 90B
Updated 10 months, 2 weeks ago
3.4K runs


lucataco/ollama-llama3.2-vision-11b
Ollama Llama 3.2 Vision 11B
Updated 10 months, 2 weeks ago
2.8K runs


lucataco/moondream2
moondream2 is a small vision language model designed to run efficiently on edge devices
Updated 1 year, 3 months ago
4.7M runs


daanelson/minigpt-4
A model which generates text in response to an input image and prompt.
Updated 1 year, 5 months ago
1.8M runs


yorickvp/llava-v1.6-vicuna-13b
LLaVA v1.6: Large Language and Vision Assistant (Vicuna-13B)
Updated 1 year, 8 months ago
3.7M runs


yorickvp/llava-v1.6-mistral-7b
LLaVA v1.6: Large Language and Vision Assistant (Mistral-7B)
Updated 1 year, 8 months ago
4.9M runs


zsxkib/uform-gen
🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️
Updated 1 year, 8 months ago
2.3K runs


adirik/kosmos-g
Kosmos-G: Generating Images in Context with Multimodal Large Language Models
Updated 1 year, 10 months ago
4.5K runs


cjwbw/cogvlm
powerful open-source visual language model
Updated 1 year, 11 months ago
1.5M runs


lucataco/bakllava
BakLLaVA-1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture
Updated 2 years ago
39.8K runs


lucataco/qwen-vl-chat
A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Updated 2 years ago
825.6K runs


adirik/owlvit-base-patch32
Zero-shot / open vocabulary object detection
Updated 2 years ago
24.4K runs


cjwbw/internlm-xcomposer
Advanced text-image comprehension and composition based on InternLM
Updated 2 years ago
164.4K runs