These models generate text descriptions and captions from videos. They use large multimodal transformers trained on vast datasets that include both video content and corresponding text, such as captions, titles, and descriptions.
Key capabilities:
Featured models

google/gemini-2.5-flash
Google’s hybrid “thinking” AI model optimized for speed and cost-efficiency
Updated 1 week, 1 day ago
48.8K runs

shreejalmaharjan-27/tiktok-short-captions
Generate Tiktok-Style Captions powered by Whisper (GPU)
Updated 11 months, 3 weeks ago
200.9K runs


fictions-ai/autocaption
Automatically add captions to a video
Updated 1 year, 10 months ago
62.8K runs
Recommended Models
If you’re after quick turnaround for short clips, lucataco/qwen2-vl-7b-instruct is a strong choice—it’s designed to process short videos efficiently while maintaining descriptive accuracy.
Another practical option for speed is fictions-ai/autocaption, which is optimized for adding captions to videos and performs well for quick runs where ultra-low latency isn’t critical.
If you want good quality without excessive compute, lucataco/qwen2-vl-7b-instruct strikes a great balance. It supports detailed video understanding and performs well for most captioning and summarization tasks.
For more complex videos that require deeper reasoning or multiple scenes, lucataco/apollo-7b offers a richer understanding with slightly higher compute tradeoffs.
For social-style captioning—bold overlays, subtitles, and engaging visuals—fictions-ai/autocaption is purpose-built. It lets you upload a video and receive an output with clean, readable captions.
You can customize font, color, and subtitle placement, making it ideal for short-form content like Reels or TikToks.
If your goal is to generate textual descriptions of what’s happening in a video (instead of just overlaying captions), lucataco/qwen2-vl-7b-instruct supports video input and produces detailed visual reasoning outputs.
This makes it useful for accessibility captions, summaries, or content indexing.
There are two main types of models here:
Overlay caption models typically output a video with subtitles and sometimes an optional transcript file.
Vision-language models usually output text responses—scene descriptions, summaries, or even conversational answers about the video content.
Many captioning and vision-language models are open source and can be self-hosted using Cog or Docker.
To publish your own model, create a replicate.yaml file defining its inputs and outputs, push it to Replicate, and it’ll run automatically on managed GPUs.
Yes—most models in this collection allow commercial use, but always check the License section on the model’s page for specific terms.
If you’re adding captions to copyrighted content, ensure you have the right to modify and distribute that media.
Go to a model’s page on Replicate, upload your video, and click Run.
Models like fictions-ai/autocaption return a captioned video, while lucataco/qwen2-vl-7b-instruct and lucataco/apollo-7b generate text outputs that you can format or display however you like.
.srt or .vtt), confirm that the model supports transcript output.Recommended Models


lucataco/minicpm-v-4
MiniCPM-V 4.0 has strong image and video understanding performance
Updated 3 months ago
262 runs


lucataco/qwen2.5-omni-7b
Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
Updated 7 months, 1 week ago
23K runs


lucataco/videollama3-7b
VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding
Updated 9 months ago
21.1K runs


lucataco/qwen2-vl-7b-instruct
Latest model in the Qwen family for chatting with video and image models
Updated 10 months, 3 weeks ago
268K runs


lucataco/apollo-7b
Apollo 7B - An Exploration of Video Understanding in Large Multimodal Models
Updated 10 months, 4 weeks ago
122.1K runs


lucataco/apollo-3b
Apollo 3B - An Exploration of Video Understanding in Large Multimodal Models
Updated 10 months, 4 weeks ago
143 runs


lucataco/bulk-video-caption
Video Preprocessing tool for captioning multiple videos using GPT, Claude or Gemini
Updated 11 months ago
176 runs


chenxwh/cogvlm2-video
CogVLM2: Visual Language Models for Image and Video Understanding
Updated 1 year, 1 month ago
669.4K runs


cuuupid/qwen2-vl-2b
SOTA open-source model for chatting with videos and the newest model in the Qwen family
Updated 1 year, 2 months ago
603 runs


lucataco/qwen-vl-chat
A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Updated 2 years, 1 month ago
825.6K runs