Vision models process and interpret visual information from images and videos. You can use vision models to answer questions about the content of an image, identify and locate objects, etc.
For example, you can use the yorickvp/llava-13b vision model to generate recipe ideas from an image of your fridge
If you don't need reasoning abilities and just want to get descriptions of images, check out our image captioning collection →
Featured models

Google’s hybrid “thinking” AI model optimized for speed and cost-efficiency
Updated 2 weeks, 2 days ago
277.9K runs

openai/gpt-4o-miniLow latency, low cost version of OpenAI's GPT-4o model
Updated 3 months, 4 weeks ago
7.3M runs

Claude Sonnet 4 is a significant upgrade to 3.7, delivering superior coding and reasoning while responding more precisely to your instructions
Updated 5 months, 4 weeks ago
1.2M runs
Recommended Models
If you want a low-latency model that can understand and talk about images, lighter options in the Vision Models collection—such as openai/gpt-4o-mini—are well-suited for quick interactions.
Faster models work well for straightforward scenes and short prompts. For more complex reasoning, you may prefer a larger model.
yorickvp/llava-13b is a strong all-around model in the Vision Models collection. It supports both image captioning and visual question answering (VQA), producing more descriptive and context-aware responses.
If you just need a quick caption, smaller models can be faster and cheaper to run.
For interactive use—like asking “What’s happening in this photo?” or “How many people are in the image?”—pick a model that supports image + text prompting, such as yorickvp/llava-13b.
These models can interpret your image in context and give natural language answers.
If you just want a brief description (e.g., for accessibility, alt text, or indexing), lighter models like openai/gpt-4o-mini can provide fast, simple captions.
You don’t need a heavyweight model unless your task requires more complex reasoning.
Most models in this collection return:
You can package your own multi-modal model with Cog and push it to Replicate.
Define your input schema (e.g., image + optional question) and output (caption or answer), then publish it to share or use commercially.
Many models in the Vision Models collection allow commercial use, but licenses vary. Always check the model card for attribution requirements or restrictions before using outputs commercially.
Recommended Models

openai/gpt-4.1-miniFast, affordable version of GPT-4.1
Updated 2 months, 3 weeks ago
1.3M runs

openai/gpt-4oOpenAI's high-intelligence chat model
Updated 3 months ago
322.1K runs

lucataco/qwen2.5-omni-7bQwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
Updated 8 months, 1 week ago
30.9K runs

The most intelligent Claude model and the first hybrid reasoning model on the market (claude-3-7-sonnet-20250219)
Updated 9 months, 2 weeks ago
3.4M runs

Anthropic's most intelligent language model to date, with a 200K token context window and image understanding (claude-3-5-sonnet-20241022)
Updated 9 months, 4 weeks ago
597.8K runs

lucataco/qwen2-vl-7b-instructLatest model in the Qwen family for chatting with video and image models
Updated 11 months, 3 weeks ago
315K runs

lucataco/ollama-llama3.2-vision-90bOllama Llama 3.2 Vision 90B
Updated 11 months, 3 weeks ago
3.5K runs

lucataco/ollama-llama3.2-vision-11bOllama Llama 3.2 Vision 11B
Updated 11 months, 3 weeks ago
3.3K runs

lucataco/moondream2moondream2 is a small vision language model designed to run efficiently on edge devices
Updated 1 year, 4 months ago
5.9M runs

yorickvp/llava-13bVisual instruction tuning towards large language and vision models with GPT-4 level capabilities
Updated 1 year, 4 months ago
32.9M runs

daanelson/minigpt-4A model which generates text in response to an input image and prompt.
Updated 1 year, 6 months ago
1.8M runs

yorickvp/llava-v1.6-vicuna-13bLLaVA v1.6: Large Language and Vision Assistant (Vicuna-13B)
Updated 1 year, 10 months ago
3.7M runs

yorickvp/llava-v1.6-mistral-7bLLaVA v1.6: Large Language and Vision Assistant (Mistral-7B)
Updated 1 year, 10 months ago
4.9M runs

zsxkib/uform-gen🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️
Updated 1 year, 10 months ago
2.3K runs

adirik/kosmos-gKosmos-G: Generating Images in Context with Multimodal Large Language Models
Updated 2 years ago
4.5K runs

cjwbw/cogvlmpowerful open-source visual language model
Updated 2 years ago
1.5M runs

lucataco/bakllavaBakLLaVA-1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture
Updated 2 years, 1 month ago
39.8K runs

lucataco/qwen-vl-chatA multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Updated 2 years, 1 month ago
825.6K runs

adirik/owlvit-base-patch32Zero-shot / open vocabulary object detection
Updated 2 years, 1 month ago
24.5K runs

cjwbw/internlm-xcomposerAdvanced text-image comprehension and composition based on InternLM
Updated 2 years, 2 months ago
164.4K runs