Chat with images
Vision models process and interpret visual information from images and videos. You can use vision models to answer questions about the content of an image, detect and locate objects, extract text using optical character recognition (OCR), verify a person’s identity through facial features, etc.
Here’s an example using the yorickvp/llava-13b vision model to generate recipe ideas from an image of your fridge:
And here’s how you can run the model from your JavaScript code:
import Replicate from "replicate";
const replicate = new Replicate();
const output = await replicate.run(
"yorickvp/llava-13b:01359160a4cff57c6b7d4dc625d0019d390c7c46f553714069f114b392f4a726",
{
input: {
image: "https://replicate.delivery/pbxt/KZOUXoMy3OxnyOeIA0LNzhtWDjBZLm9T6IPm5lbKcFT8lybo/fridge.png",
prompt: "Here's a photo of my fridge today. Please give me some simple recipe ideas based on its contents.",
}
}
);
console.log(output);
Recommended models
![](https://tjzk.replicate.delivery/models_models_featured_image/454548d6-4978-4d85-bca3-d067dfc031bf/llava.png)
yorickvp/llava-13b
Visual instruction tuning towards large language and vision models with GPT-4 level capabilities
![](https://tjzk.replicate.delivery/models_models_cover_image/341d1fb8-9d72-4d9b-9fc7-b1a29ad85bcd/db72a8f8-759b-48db-8f18-316cd632.webp)
yorickvp/llava-v1.6-vicuna-13b
LLaVA v1.6: Large Language and Vision Assistant (Vicuna-13B)
![](https://tjzk.replicate.delivery/models_models_cover_image/6bc7974c-7209-4877-98f5-23e77ef1c6da/fa58799b-aa47-4117-bf1f-25149e2d.webp)
yorickvp/llava-v1.6-mistral-7b
LLaVA v1.6: Large Language and Vision Assistant (Mistral-7B)
![](https://tjzk.replicate.delivery/models_models_cover_image/c6163ba0-edfc-4b53-9a23-eab7fd08b28a/b14df1cd-2e49-4e6b-b965-0deea7c1.webp)
yorickvp/llava-v1.6-34b
LLaVA v1.6: Large Language and Vision Assistant (Nous-Hermes-2-34B)
![](https://tjzk.replicate.delivery/models_models_cover_image/af717919-83de-46e8-9b1a-9c66f4f747bf/out_0.png)
daanelson/minigpt-4
A model which generates text in response to an input image and prompt.
![](https://tjzk.replicate.delivery/models_models_cover_image/7cd09060-b91e-4261-a03e-ed772aa2e044/qwen.jpg)
lucataco/qwen-vl-chat
A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
![](https://replicate.delivery/pbxt/JcqDxAZJWep7WsZdWM0gc6Ead2ie0YDEXyemc9HXogSdpsOM/out-0%20(1).png)
cjwbw/internlm-xcomposer
Advanced text-image comprehension and composition based on InternLM
![](https://tjzk.replicate.delivery/models_models_cover_image/3cbb4e68-08b8-4e82-8e83-3300f877dd0f/moondream2.png)
lucataco/moondream2
moondream2 is a small vision language model designed to run efficiently on edge devices
![](https://tjzk.replicate.delivery/models_models_cover_image/e8f484a0-8859-4fe3-b3a7-6f77c6f5e658/mplug-owl-logo.png)
joehoover/mplug-owl
An instruction-tuned multimodal large language model that generates text based on user-provided prompts and images
![](https://replicate.delivery/pbxt/oO5rHoHwsrYGJh5HeElqpBBmjoi1gkXxGofpiQuxMvDNlduRA/result.png)
adirik/owlvit-base-patch32
Zero-shot / open vocabulary object detection
![](https://tjzk.replicate.delivery/models_models_cover_image/38f48931-c374-42ce-abce-46af276f675e/replicate-prediction-xljpyblbcoee.png)
lucataco/moondream1
(Research only) Moondream1 is a vision language model that performs on par with models twice its size
![](https://replicate.delivery/pbxt/aYXKkkuCWM6WN96DfeKBlVFTle875XmLLmMkJFp7rkNryqfHB/0.png)
adirik/kosmos-g
Kosmos-G: Generating Images in Context with Multimodal Large Language Models
![](https://tjzk.replicate.delivery/models_models_featured_image/5fb8bc45-8f92-4869-aa5e-f160b28f4f95/nougat2.png)
adirik/nougat
Nougat: Neural Optical Understanding for Academic Documents
![](https://replicate.delivery/pbxt/09z14i0H7QZhDtBvCnC1WtH05GpU60ZEliQ3ZNRW4WqEf9fRA/output1.png)
adirik/masactrl-sdxl
Editable image generation with MasaCtrl-SDXL
![](https://tjzk.replicate.delivery/models_models_cover_image/e0ddda07-f247-4def-b479-05fa101978b3/tux.png)
zsxkib/uform-gen
🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️
![](https://replicate.delivery/pbxt/9V1JDXddn1osJp1nffLISS38MtF3BdVT46Xg5npK2cFHZAoQA/sample.png)
cjwbw/unidiffuser
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
![](https://replicate.delivery/pbxt/JZLrEi8rmF5aqLeg0iney67r2dPhCcGudYNedSLuHa0chpqk/image%20(1).png)
cjwbw/idefics
Open-access reproduction of large visual language model Flamingo
![](https://replicate.delivery/pbxt/Yd2zn7zhfM1kcSEPox8xdt9tejoaGc8nRypYBp6yJc49cGZRA/out.png)
cjwbw/unival
Unified Model for Image, Video, Audio and Language Tasks