yorickvp / llava-13b

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities

If you haven’t yet trained a model on Replicate, we recommend you read one of the following guides.


Trainings for this model run on Nvidia A100 (80GB) GPU hardware, which costs $0.0014 per second.

Create a training

Install the Python library:

pip install replicate

Then, run this to create a training with yorickvp/llava-13b:b5f6212d as the base model:

import replicate

training = replicate.trainings.create(

curl -s -X POST \
-d '{"destination": "{username}/<destination-model-name>", "input": {...}}' \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \

The API response will look like this:

  "id": "zz4ibbonubfz7carwiefibzgga",
  "version": "b5f6212d032508382d61ff00469ddda3e32fd8a0e75dc39d8a4191bb742157fb",
  "status": "starting",
  "input": {
    "data": "..."
  "output": null,
  "error": null,
  "logs": null,
  "started_at": null,
  "created_at": "2023-03-28T21:47:58.566434Z",
  "completed_at": null

Note that before you can create a training, you’ll need to create a model and use its name as the value for the destination field.

You can finetune LLaVA with your own dataset, using LoRA techniques! Training data can be passed to cog train with the train_data parameter. Your training dataset should be a zip-file with the following structure:

  • ./images/: A folder with training data images.
  • ./data.json: A JSON file that links images to conversations. For details, see the dataset format instructions in the github repository.

Example code for training:

import replicate

training = replicate.trainings.create(
        "train_data": "https://my-domain/my-input-images.zip",

You can find more information about finetuning image models in the Replicate docs. The tutorial on finetuning SDXL with your own images is a good starting point.