yorickvp / llava-13b

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities

  • Public
  • 16.5M runs
  • GitHub
  • Paper
  • License

Check out the different LLaVA’s on Replicate:

Name Version Base Size Finetunable
v1.5 - Vicuna-13B v1.5 Vicuna 13B Yes
v1.6 - Vicuna-13B v1.6 Vicuna 13B No
v1.6 - Vicuna-7B v1.6 Vicuna 7B No
v1.6 - Mistral-7B v1.6 Mistral 7B No
v1.6 - Nous-Hermes-2-34B v1.6 Nous-Hermes-2 34B No

🌋 LLaVA v1.5: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

[Project Page] [Demo] [Data] [Model Zoo]

Improved Baselines with Visual Instruction Tuning [Paper]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

Summary

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

Fine-tuning

You can finetune LLaVA with your own dataset, using LoRA techniques! Training data can be passed to cog train with the train_data parameter. Your training dataset should be a zip-file with the following structure:

  • ./images/: A folder with training data images.
  • ./data.json: A JSON file that links images to conversations. For details, see the dataset format instructions in the github repository.

Example contents of data.zip:

./images
./images/76969503-464d-47bc-bc38-08236d746c36.webp
./images/2b0c1c1a-b833-4fc5-b24e-2bab721ea403.webp
./images/17446aea-fd1c-4be5-914a-2d9063a79696.webp
./data.json

Example data.json:

[
    {
        "image": "76969503-464d-47bc-bc38-08236d746c36.webp",
        "id": "76969503-464d-47bc-bc38-08236d746c36",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWrite a prompt for Stable Diffusion to generate this image."
            },
            {
                "from": "gpt",
                "value": "portrait of beautiful female warrior by irakli nadar with intricate detailed wearing heavy armor dress designed by alexander mcqueen and rocky gathercole, perfect symmetrical pose, haunting, elite, elegant, ruan jia, dark, hyper detailed, concept art, by gustav klimt, "
            }
        ]
    },
    {
        "image": "2b0c1c1a-b833-4fc5-b24e-2bab721ea403.webp",
        "id": "2b0c1c1a-b833-4fc5-b24e-2bab721ea403",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWrite a prompt for Stable Diffusion to generate this image."
            },
            {
                "from": "gpt",
                "value": "floating heads in the sky, one line drawing, monochrome, magritte, jack gaughan, hyperrealism "
            }
        ]
    },
    {
        "image": "17446aea-fd1c-4be5-914a-2d9063a79696.webp",
        "id": "17446aea-fd1c-4be5-914a-2d9063a79696",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWrite a prompt for Stable Diffusion to generate this image."
            },
            {
                "from": "gpt",
                "value": "commissioned fursona portrait of a humanoid red panda wearing a victorian dress, painting by fernand toussaint, trending on artstation, "
            }
        ]
    }
]

Example code for training:

import replicate

training = replicate.trainings.create(
    version="yorickvp/llava-13b:[version_id]",
    input={
        "train_data": "https://my-domain/my-input-images.zip",
    },
    destination="my-name/my-model"
)
print(training)

You can find more information about finetuning image models in the Replicate docs. The tutorial on finetuning SDXL with your own images is a good starting point.