Collections

Chat with images

Vision models process and interpret visual information from images and videos. You can use vision models to answer questions about the content of an image, identify and locate objects, etc.

Here’s an example using the yorickvp/llava-13b vision model to generate recipe ideas from an image of your fridge:

fridge

And here’s how you can run the model from your JavaScript code:

import Replicate from "replicate";
const replicate = new Replicate();

const output = await replicate.run(
  "yorickvp/llava-13b:01359160a4cff57c6b7d4dc625d0019d390c7c46f553714069f114b392f4a726",
  {
    input: {
      image: "https://replicate.delivery/pbxt/KZOUXoMy3OxnyOeIA0LNzhtWDjBZLm9T6IPm5lbKcFT8lybo/fridge.png",
      prompt: "Here's a photo of my fridge today. Please give me some simple recipe ideas based on its contents.",
    }
  }
);
console.log(output);

If you don’t need reasoning abilities and just want to get descriptions of images, check out our image captioning collection →

Recommended models

openai / gpt-4.1-mini

Fast, affordable version of GPT-4.1

Updated 1 day, 7 hours ago

1.2M runs

openai / gpt-4o

OpenAI's high-intelligence chat model

Updated 1 week ago

198.8K runs

lucataco / qwen2.5-omni-7b

Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.

Updated 5 months, 1 week ago

13.1K runs

anthropic / claude-3.7-sonnet

The most intelligent Claude model and the first hybrid reasoning model on the market (claude-3-7-sonnet-20250219)

Updated 6 months, 3 weeks ago

2.7M runs

anthropic / claude-3.5-sonnet

Anthropic's most intelligent language model to date, with a 200K token context window and image understanding (claude-3-5-sonnet-20241022)

Updated 7 months ago

523.5K runs

lucataco / qwen2-vl-7b-instruct

Latest model in the Qwen family for chatting with video and image models

Updated 8 months, 4 weeks ago

250.6K runs

lucataco / ollama-llama3.2-vision-90b

Ollama Llama 3.2 Vision 90B

Updated 9 months ago

3.4K runs

lucataco / ollama-llama3.2-vision-11b

Ollama Llama 3.2 Vision 11B

Updated 9 months ago

2.7K runs

lucataco / moondream2

moondream2 is a small vision language model designed to run efficiently on edge devices

Updated 1 year, 1 month ago

3.2M runs

yorickvp / llava-v1.6-vicuna-13b

LLaVA v1.6: Large Language and Vision Assistant (Vicuna-13B)

Updated 1 year, 7 months ago

3.7M runs

yorickvp / llava-v1.6-mistral-7b

LLaVA v1.6: Large Language and Vision Assistant (Mistral-7B)

Updated 1 year, 7 months ago

4.9M runs

zsxkib / uform-gen

🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️

Updated 1 year, 7 months ago

2.3K runs

adirik / kosmos-g

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Updated 1 year, 9 months ago

4.5K runs

lucataco / bakllava

BakLLaVA-1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture

Updated 1 year, 10 months ago

39.8K runs

lucataco / qwen-vl-chat

A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.

Updated 1 year, 11 months ago

825.5K runs

adirik / owlvit-base-patch32

Zero-shot / open vocabulary object detection

Updated 1 year, 11 months ago

24.4K runs