n1jl0091 / video-llava-7b-hf_replicate_n1jl0091

Upload an image or video, and Video-LLaVa will give you a text description of what it "sees."

  • Public
  • 73 runs
  • GitHub
  • Weights
  • Paper
  • License

Information from HuggingFace: https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf

Model Details:

Model type: Video-LLaVA is an open-source multomodal model trained by fine-tuning LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. Base LLM: lmsys/vicuna-13b-v1.5

Model Description: The model can generate interleaving images and videos, despite the absence of image-video pairs in the dataset. Video-LLaVa is uses an encoder trained for unified visual representation through alignment prior to projection. Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.