yorickvp / llava-13b

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities

Warm

Public
25.7M runs
L40S
GitHub
Paper
License

Run with an API

Playground API Examples Train README Versions

Input

Run this model in Node.js with one line of code:

npx create-replicate --model=yorickvp/llava-13b

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run yorickvp/llava-13b using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "yorickvp/llava-13b:80537f9eead1a5bfa72d5ac6ea6414379be41d4d4f6679fd776e9535d1eb58bb",
  {
    input: {
      image: "https://replicate.delivery/pbxt/KRULC43USWlEx4ZNkXltJqvYaHpEx2uJ4IyUQPRPwYb8SzPf/view.jpg",
      top_p: 1,
      prompt: "Are you allowed to swim here?",
      max_tokens: 1024,
      temperature: 0.2
    }
  }
);

console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run yorickvp/llava-13b using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "yorickvp/llava-13b:80537f9eead1a5bfa72d5ac6ea6414379be41d4d4f6679fd776e9535d1eb58bb",
    input={
        "image": "https://replicate.delivery/pbxt/KRULC43USWlEx4ZNkXltJqvYaHpEx2uJ4IyUQPRPwYb8SzPf/view.jpg",
        "top_p": 1,
        "prompt": "Are you allowed to swim here?",
        "max_tokens": 1024,
        "temperature": 0.2
    }
)

# The yorickvp/llava-13b model can stream output as it's running.
# The predict method returns an iterator, and you can iterate over that output.
for item in output:
    # https://replicate.com/yorickvp/llava-13b/api#output-schema
    print(item, end="")

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run yorickvp/llava-13b using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "80537f9eead1a5bfa72d5ac6ea6414379be41d4d4f6679fd776e9535d1eb58bb",
    "input": {
      "image": "https://replicate.delivery/pbxt/KRULC43USWlEx4ZNkXltJqvYaHpEx2uJ4IyUQPRPwYb8SzPf/view.jpg",
      "top_p": 1,
      "prompt": "Are you allowed to swim here?",
      "max_tokens": 1024,
      "temperature": 0.2
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

Yes, you are allowed to swim in the lake near the pier. The image shows a pier extending out into the water, and the water appears to be calm, making it a suitable spot for swimming. However, it is always important to be cautious and aware of any potential hazards or regulations in the area before swimming.

Generated in

5.3 seconds

Tweak it Report

This output was created using a different version of the model, yorickvp/llava-13b:a0fdc44e.

Run time and cost

This model costs approximately $0.00098 to run on Replicate, or 1020 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 1 seconds.

Readme

Check out the different LLaVA’s on Replicate:

Name	Version	Base	Size	Finetunable
v1.5 - Vicuna-13B	v1.5	Vicuna	13B	Yes
v1.6 - Vicuna-13B	v1.6	Vicuna	13B	No
v1.6 - Vicuna-7B	v1.6	Vicuna	7B	No
v1.6 - Mistral-7B	v1.6	Mistral	7B	No
v1.6 - Nous-Hermes-2-34B	v1.6	Nous-Hermes-2	34B	No

🌋 LLaVA v1.5: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

[Project Page] [Demo] [Data] [Model Zoo]

Improved Baselines with Visual Instruction Tuning [Paper]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

Summary

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

Fine-tuning

You can finetune LLaVA with your own dataset, using LoRA techniques! Training data can be passed to cog train with the train_data parameter. Your training dataset should be a zip-file with the following structure:

./images/: A folder with training data images.
./data.json: A JSON file that links images to conversations. For details, see the dataset format instructions in the github repository.

Example contents of data.zip:

./images
./images/76969503-464d-47bc-bc38-08236d746c36.webp
./images/2b0c1c1a-b833-4fc5-b24e-2bab721ea403.webp
./images/17446aea-fd1c-4be5-914a-2d9063a79696.webp
./data.json

Example data.json:

[
    {
        "image": "76969503-464d-47bc-bc38-08236d746c36.webp",
        "id": "76969503-464d-47bc-bc38-08236d746c36",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWrite a prompt for Stable Diffusion to generate this image."
            },
            {
                "from": "gpt",
                "value": "portrait of beautiful female warrior by irakli nadar with intricate detailed wearing heavy armor dress designed by alexander mcqueen and rocky gathercole, perfect symmetrical pose, haunting, elite, elegant, ruan jia, dark, hyper detailed, concept art, by gustav klimt, "
            }
        ]
    },
    {
        "image": "2b0c1c1a-b833-4fc5-b24e-2bab721ea403.webp",
        "id": "2b0c1c1a-b833-4fc5-b24e-2bab721ea403",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWrite a prompt for Stable Diffusion to generate this image."
            },
            {
                "from": "gpt",
                "value": "floating heads in the sky, one line drawing, monochrome, magritte, jack gaughan, hyperrealism "
            }
        ]
    },
    {
        "image": "17446aea-fd1c-4be5-914a-2d9063a79696.webp",
        "id": "17446aea-fd1c-4be5-914a-2d9063a79696",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWrite a prompt for Stable Diffusion to generate this image."
            },
            {
                "from": "gpt",
                "value": "commissioned fursona portrait of a humanoid red panda wearing a victorian dress, painting by fernand toussaint, trending on artstation, "
            }
        ]
    }
]

Example code for training:

import replicate

training = replicate.trainings.create(
    version="yorickvp/llava-13b:[version_id]",
    input={
        "train_data": "https://my-domain/my-input-images.zip",
    },
    destination="my-name/my-model"
)
print(training)

You can find more information about finetuning image models in the Replicate docs. The tutorial on finetuning SDXL with your own images is a good starting point.