Collections

Caption videos

These models generate text descriptions and captions from videos. They use large multimodal transformers trained on vast datasets that include both video content and corresponding text, such as captions, titles, and descriptions.

Key capabilities:

  • Video captioning: Produce relevant captions summarizing video contents and context. Useful for indexing videos and accessibility. Automate alt text for videos.
  • Visual question answering: Generate natural language answers to questions about videos. Ask questions about your images.

Recommended models

lucataco / minicpm-v-4

MiniCPM-V 4.0 has strong image and video understanding performance

Updated 1 month, 3 weeks ago

209 runs

lucataco / qwen2.5-omni-7b

Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.

Updated 6 months ago

13.8K runs

lucataco / videollama3-7b

VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding

Updated 7 months, 3 weeks ago

8K runs

lucataco / apollo-7b

Apollo 7B - An Exploration of Video Understanding in Large Multimodal Models

Updated 9 months, 2 weeks ago

107K runs

lucataco / apollo-3b

Apollo 3B - An Exploration of Video Understanding in Large Multimodal Models

Updated 9 months, 2 weeks ago

142 runs

lucataco / bulk-video-caption

Video Preprocessing tool for captioning multiple videos using GPT, Claude or Gemini

Updated 9 months, 3 weeks ago

136 runs

chenxwh / cogvlm2-video

CogVLM2: Visual Language Models for Image and Video Understanding

Updated 1 year ago

662.3K runs

cuuupid / qwen2-vl-2b

SOTA open-source model for chatting with videos and the newest model in the Qwen family

Updated 1 year, 1 month ago

594 runs

lucataco / qwen-vl-chat

A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.

Updated 1 year, 11 months ago

825.5K runs