These models generate text descriptions and captions from videos. They use large multimodal transformers trained on vast datasets that include both video content and corresponding text, such as captions, titles, and descriptions.
Key capabilities:
Featured models

Google’s hybrid “thinking” AI model optimized for speed and cost-efficiency
Updated 1 week, 4 days ago
162.9K runs
shreejalmaharjan-27/tiktok-short-captionsGenerate Tiktok-Style Captions powered by Whisper (GPU)
Updated 1 year ago
203.3K runs

fictions-ai/autocaptionAutomatically add captions to a video
Updated 1 year, 11 months ago
66.7K runs
Recommended Models
If you’re after quick turnaround for short clips, lucataco/qwen2-vl-7b-instruct is a strong choice—it’s designed to process short videos efficiently while maintaining descriptive accuracy.
Another practical option for speed is fictions-ai/autocaption, which is optimized for adding captions to videos and performs well for quick runs where ultra-low latency isn’t critical.
If you want good quality without excessive compute, lucataco/qwen2-vl-7b-instruct strikes a great balance. It supports detailed video understanding and performs well for most captioning and summarization tasks.
For more complex videos that require deeper reasoning or multiple scenes, lucataco/apollo-7b offers a richer understanding with slightly higher compute tradeoffs.
For social-style captioning—bold overlays, subtitles, and engaging visuals—fictions-ai/autocaption is purpose-built. It lets you upload a video and receive an output with clean, readable captions.
You can customize font, color, and subtitle placement, making it ideal for short-form content like Reels or TikToks.
If your goal is to generate textual descriptions of what’s happening in a video (instead of just overlaying captions), lucataco/qwen2-vl-7b-instruct supports video input and produces detailed visual reasoning outputs.
This makes it useful for accessibility captions, summaries, or content indexing.
There are two main types of models here:
Overlay caption models typically output a video with subtitles and sometimes an optional transcript file.
Vision-language models usually output text responses—scene descriptions, summaries, or even conversational answers about the video content.
Many captioning and vision-language models are open source and can be self-hosted using Cog or Docker.
To publish your own model, create a replicate.yaml file defining its inputs and outputs, push it to Replicate, and it’ll run automatically on managed GPUs.
Yes—most models in this collection allow commercial use, but always check the License section on the model’s page for specific terms.
If you’re adding captions to copyrighted content, ensure you have the right to modify and distribute that media.
Go to a model’s page on Replicate, upload your video, and click Run.
Models like fictions-ai/autocaption return a captioned video, while lucataco/qwen2-vl-7b-instruct and lucataco/apollo-7b generate text outputs that you can format or display however you like.
.srt or .vtt), confirm that the model supports transcript output.Recommended Models

lucataco/minicpm-v-4MiniCPM-V 4.0 has strong image and video understanding performance
Updated 3 months, 3 weeks ago
269 runs

lucataco/qwen2.5-omni-7bQwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
Updated 8 months ago
29.1K runs

lucataco/videollama3-7bVideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding
Updated 9 months, 3 weeks ago
21.4K runs

lucataco/qwen2-vl-7b-instructLatest model in the Qwen family for chatting with video and image models
Updated 11 months, 2 weeks ago
304.8K runs

lucataco/apollo-7bApollo 7B - An Exploration of Video Understanding in Large Multimodal Models
Updated 11 months, 2 weeks ago
122.5K runs

lucataco/apollo-3bApollo 3B - An Exploration of Video Understanding in Large Multimodal Models
Updated 11 months, 2 weeks ago
145 runs

lucataco/bulk-video-captionVideo Preprocessing tool for captioning multiple videos using GPT, Claude or Gemini
Updated 11 months, 3 weeks ago
178 runs

chenxwh/cogvlm2-videoCogVLM2: Visual Language Models for Image and Video Understanding
Updated 1 year, 2 months ago
670.7K runs

cuuupid/qwen2-vl-2bSOTA open-source model for chatting with videos and the newest model in the Qwen family
Updated 1 year, 3 months ago
606 runs

lucataco/qwen-vl-chatA multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Updated 2 years, 1 month ago
825.6K runs