Vision models process and interpret visual information from images and videos. You can use vision models to answer questions about the content of an image, identify and locate objects, etc.
Here's an example using the yorickvp/llava-13b vision model to generate recipe ideas from an image of your fridge:
<a href="https://replicate.com/p/c4jewm3bmyqz4og3y2itvrvc5u"> <img alt="fridge" src="https://github.com/replicate/cog/assets/2289/55bd8de4-43cf-4a16-ad87-d2bb2ea5e42f"> </a>And here’s how you can run the model from your JavaScript code:
import Replicate from "replicate";
const replicate = new Replicate();
const output = await replicate.run(
"yorickvp/llava-13b:01359160a4cff57c6b7d4dc625d0019d390c7c46f553714069f114b392f4a726",
{
input: {
image: "https://replicate.delivery/pbxt/KZOUXoMy3OxnyOeIA0LNzhtWDjBZLm9T6IPm5lbKcFT8lybo/fridge.png",
prompt: "Here's a photo of my fridge today. Please give me some simple recipe ideas based on its contents.",
}
}
);
console.log(output);
If you don't need reasoning abilities and just want to get descriptions of images, check out our image captioning collection →
Featured models
openai/gpt-4o-mini
Low latency, low cost version of OpenAI's GPT-4o model
Updated 1 month, 3 weeks ago
3.3M runs
anthropic/claude-4-sonnet
Claude Sonnet 4 is a significant upgrade to 3.7, delivering superior coding and reasoning while responding more precisely to your instructions
Updated 3 months, 3 weeks ago
902K runs
yorickvp/llava-13b
Visual instruction tuning towards large language and vision models with GPT-4 level capabilities
Updated 1 year, 2 months ago
31.3M runs
Recommended Models
Recommended Models
openai/gpt-4.1-mini
Fast, affordable version of GPT-4.1
Updated 3 weeks, 1 day ago
1.3M runs
openai/gpt-4o
OpenAI's high-intelligence chat model
Updated 4 weeks ago
233.7K runs
lucataco/qwen2.5-omni-7b
Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
Updated 6 months ago
13.8K runs
anthropic/claude-3.7-sonnet
The most intelligent Claude model and the first hybrid reasoning model on the market (claude-3-7-sonnet-20250219)
Updated 7 months, 1 week ago
2.9M runs
anthropic/claude-3.5-sonnet
Anthropic's most intelligent language model to date, with a 200K token context window and image understanding (claude-3-5-sonnet-20241022)
Updated 7 months, 3 weeks ago
532.5K runs
lucataco/qwen2-vl-7b-instruct
Latest model in the Qwen family for chatting with video and image models
Updated 9 months, 2 weeks ago
256.7K runs
lucataco/ollama-llama3.2-vision-90b
Ollama Llama 3.2 Vision 90B
Updated 9 months, 3 weeks ago
3.4K runs
lucataco/ollama-llama3.2-vision-11b
Ollama Llama 3.2 Vision 11B
Updated 9 months, 3 weeks ago
2.7K runs
lucataco/moondream2
moondream2 is a small vision language model designed to run efficiently on edge devices
Updated 1 year, 2 months ago
3.9M runs
daanelson/minigpt-4
A model which generates text in response to an input image and prompt.
Updated 1 year, 4 months ago
1.8M runs
yorickvp/llava-v1.6-vicuna-13b
LLaVA v1.6: Large Language and Vision Assistant (Vicuna-13B)
Updated 1 year, 8 months ago
3.7M runs
yorickvp/llava-v1.6-mistral-7b
LLaVA v1.6: Large Language and Vision Assistant (Mistral-7B)
Updated 1 year, 8 months ago
4.9M runs
zsxkib/uform-gen
🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️
Updated 1 year, 8 months ago
2.3K runs
adirik/kosmos-g
Kosmos-G: Generating Images in Context with Multimodal Large Language Models
Updated 1 year, 10 months ago
4.5K runs
cjwbw/cogvlm
powerful open-source visual language model
Updated 1 year, 10 months ago
1.5M runs
lucataco/bakllava
BakLLaVA-1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture
Updated 1 year, 11 months ago
39.8K runs
lucataco/qwen-vl-chat
A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Updated 1 year, 11 months ago
825.5K runs
adirik/owlvit-base-patch32
Zero-shot / open vocabulary object detection
Updated 1 year, 11 months ago
24.4K runs
cjwbw/internlm-xcomposer
Advanced text-image comprehension and composition based on InternLM
Updated 2 years ago
164.4K runs