Collections

Chat with images

Vision models process and interpret visual information from images and videos. You can use vision models to answer questions about the content of an image, detect and locate objects, extract text using optical character recognition (OCR), verify a person’s identity through facial features, etc.

Here’s an example using the yorickvp/llava-13b vision model to generate recipe ideas from an image of your fridge:

fridge

And here’s how you can run the model from your JavaScript code:

import Replicate from "replicate";
const replicate = new Replicate();

const output = await replicate.run(
  "yorickvp/llava-13b:01359160a4cff57c6b7d4dc625d0019d390c7c46f553714069f114b392f4a726",
  {
    input: {
      image: "https://replicate.delivery/pbxt/KZOUXoMy3OxnyOeIA0LNzhtWDjBZLm9T6IPm5lbKcFT8lybo/fridge.png",
      prompt: "Here's a photo of my fridge today. Please give me some simple recipe ideas based on its contents.",
    }
  }
);
console.log(output);

Recommended models

yorickvp / llava-13b

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities

19.3M runs

yorickvp / llava-v1.6-mistral-7b

LLaVA v1.6: Large Language and Vision Assistant (Mistral-7B)

4.8M runs

yorickvp / llava-v1.6-vicuna-13b

LLaVA v1.6: Large Language and Vision Assistant (Vicuna-13B)

3.3M runs

yorickvp / llava-v1.6-34b

LLaVA v1.6: Large Language and Vision Assistant (Nous-Hermes-2-34B)

1.7M runs

daanelson / minigpt-4

A model which generates text in response to an input image and prompt.

1.4M runs

lucataco / qwen-vl-chat

A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.

791.1K runs

lucataco / moondream2

moondream2 is a small vision language model designed to run efficiently on edge devices

219.3K runs

cjwbw / internlm-xcomposer

Advanced text-image comprehension and composition based on InternLM

164.2K runs

adirik / owlvit-base-patch32

Zero-shot / open vocabulary object detection

22.9K runs

lucataco / moondream1

(Research only) Moondream1 is a vision language model that performs on par with models twice its size

10.4K runs

adirik / kosmos-g

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

4.1K runs

adirik / masactrl-sdxl

Editable image generation with MasaCtrl-SDXL

3.4K runs

zsxkib / uform-gen

🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️

2.1K runs

cjwbw / unidiffuser

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

1.1K runs

cjwbw / unival

Unified Model for Image, Video, Audio and Language Tasks

934 runs