interact-brands / llava-13b-spotter-creator

Fine-tuned LLaVa model for youtube thumbnail classification

  • Public
  • 22 runs
  • Fine-tune

Run time and cost

This model runs on Nvidia A40 (Large) GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Fine-Tuned LLAVA-13B-Vicuna Model Overview This repository contains the fine-tuned LLAVA-13B-Vicuna model, specifically trained for classifying YouTube thumbnails across multiple dimensions, including Composition, Color Scheme, and Primary Creator Emotion. The model was trained on a curated dataset featuring thumbnails from popular creators such as Mr. Beast, Marques Brownlee, Veritasium, Mark Rober, The Try Guys, and Captain Disillusion.

Data Preparation Data Sourcing Thumbnails from the mentioned creators were collected and organized into separate folders based on the classification dimensions. The structured organization of the dataset enabled efficient fine-tuning of the model.

Fine-Tuning the Model Model Selection & Training The fine-tuning was performed using the LLAVA-13B-Vicuna model, leveraging both visual and textual data to handle all three classification tasks simultaneously. The model training involved the following key steps:

Dataset: Annotated thumbnails organized in data.json. Training Duration: 5 epochs, with a steady decrease in loss observed.

Model Architecture and Hyperparameters Architecture: LLAVA 13B, utilizing the Wikuna 13 billion parameter language decoder and CLIP vision transformer. Batch Size: 16 Learning Rate: 0.0002 LoRA Adapters: Enabled, with parameters r=128 and alpha=256. Optimizer: AdamW, coupled with a cosine learning rate scheduler. Logging: Enabled at every training step.

Hardware: Trained on NVIDIA A100 80 GB GPU. Usage To use the fine-tuned model, you can access it via the provided API endpoints or integrate it directly into your application. Detailed instructions, are to be added soon.