Collections

Caption videos

These models generate text descriptions and captions from videos. They use large multimodal transformers trained on vast datasets that include both video content and corresponding text, such as captions, titles, and descriptions.

Key capabilities:

  • Video captioning: Produce relevant captions summarizing video contents and context. Useful for indexing videos and accessibility. Automate alt text for videos.
  • Visual question answering: Generate natural language answers to questions about videos. Ask questions about your images.

Frequently asked questions

Which models are the fastest?

If you’re after quick turnaround for short clips, lucataco/qwen2-vl-7b-instruct is a strong choice—it’s designed to process short videos efficiently while maintaining descriptive accuracy.

Another practical option for speed is fictions-ai/autocaption, which is optimized for adding captions to videos and performs well for quick runs where ultra-low latency isn’t critical.

Which models offer the best balance of cost and quality?

If you want good quality without excessive compute, lucataco/qwen2-vl-7b-instruct strikes a great balance. It supports detailed video understanding and performs well for most captioning and summarization tasks.

For more complex videos that require deeper reasoning or multiple scenes, lucataco/apollo-7b offers a richer understanding with slightly higher compute tradeoffs.

What works best for adding stylized captions to social videos?

For social-style captioning—bold overlays, subtitles, and engaging visuals—fictions-ai/autocaption is purpose-built. It lets you upload a video and receive an output with clean, readable captions.

You can customize font, color, and subtitle placement, making it ideal for short-form content like Reels or TikToks.

What works best for scene-level description or video understanding?

If your goal is to generate textual descriptions of what’s happening in a video (instead of just overlaying captions), lucataco/qwen2-vl-7b-instruct supports video input and produces detailed visual reasoning outputs.
This makes it useful for accessibility captions, summaries, or content indexing.

What’s the difference between key subtypes or approaches in this collection?

There are two main types of models here:

  • Overlay caption models (e.g., autocaption): These take a video file and add subtitles directly to the output, ideal for ready-to-publish content.
  • Vision-language models (e.g., qwen2-vl-7b-instruct): These interpret the visuals and generate descriptive text about what’s happening in the video. They offer more flexibility but may require post-processing.

What kinds of outputs can I expect from these models?

Overlay caption models typically output a video with subtitles and sometimes an optional transcript file.

Vision-language models usually output text responses—scene descriptions, summaries, or even conversational answers about the video content.

How can I self-host or push a model to Replicate?

Many captioning and vision-language models are open source and can be self-hosted using Cog or Docker.

To publish your own model, create a replicate.yaml file defining its inputs and outputs, push it to Replicate, and it’ll run automatically on managed GPUs.

Can I use these models for commercial work?

Yes—most models in this collection allow commercial use, but always check the License section on the model’s page for specific terms.

If you’re adding captions to copyrighted content, ensure you have the right to modify and distribute that media.

How do I use or run these models?

Go to a model’s page on Replicate, upload your video, and click Run.
Models like fictions-ai/autocaption return a captioned video, while lucataco/qwen2-vl-7b-instruct and lucataco/apollo-7b generate text outputs that you can format or display however you like.

What should I know before running a job in this collection?

  • Longer or high-resolution videos require more compute, so trim clips when possible.
  • If you need timestamped captions (like .srt or .vtt), confirm that the model supports transcript output.
  • Vision-language models currently focus on visual reasoning—some don’t interpret audio yet, so spoken dialogue might not be included in results.

Any other collection-specific tips or considerations?

  • For batch processing, use models like lucataco/bulk-video-caption to handle multiple videos efficiently.
  • For social media workflows, choose a model that supports subtitle styling and automatic line breaks.
  • For accessibility or archival tasks, consider combining both types: overlay captions for the video and descriptive text from a vision-language model.
  • Always review generated captions or descriptions—models can miss nuance, subtle action, or audio-only context.