hjunior29/video-text-generator

Generate animated captions and text overlays for social media videos

Public
47 runs

Run time and cost

This model runs on Nvidia H100 GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Video Text Generator

Add TikTok-style captions and custom text overlays to videos using AI transcription.

Overview

This model automatically transcribes videos using OpenAI Whisper large-v3 and generates animated karaoke-style captions with word-by-word highlighting. It also supports custom text overlays for titles, call-to-actions, and other dynamic text elements.

Features

  • AI Transcription: OpenAI Whisper large-v3 for accurate speech-to-text in 100+ languages
  • Word-by-Word Highlighting: Karaoke-style captions that highlight each word as it’s spoken
  • Multiple Caption Styles: “classic” (stroke text) or “boxed” (background boxes like TikTok/Instagram)
  • Custom Text Overlays: Add titles, CTAs, and dynamic text with precise timing and positioning
  • Customizable Design: Configure colors, sizes, and positions

Input Parameters

Parameter Type Default Description
video file required Video file (MP4, MOV, MKV, WebM)
caption_style string boxed Caption style: “classic” or “boxed”
caption_size int 60 Font size for captions
highlight_color string #39E508 Highlight color (hex format)
caption_position int 150 Distance from bottom in pixels
text_overlays string "" JSON array of text overlays

Text Overlays Format

[
  {
    "text": "WATCH THIS!",
    "startMs": 0,
    "endMs": 2000,
    "position": 400,
    "fontSize": 80,
    "color": "#FFFFFF"
  },
  {
    "text": "Follow for more!",
    "startMs": 5000,
    "endMs": 7000,
    "position": 500,
    "fontSize": 50,
    "color": "#FF0000",
    "backgroundColor": "rgba(0,0,0,0.5)"
  }
]

Use Cases

  • Social Media Content: Add captions to TikTok, Instagram Reels, YouTube Shorts
  • Accessibility: Make videos accessible to deaf/hard-of-hearing viewers
  • Marketing: Add titles, CTAs, and promotional text to videos
  • Education: Caption educational and tutorial content
  • Localization: Transcribe content for translation workflows

Model Architecture

  • Transcription: OpenAI Whisper large-v3
  • Rendering: Remotion (React-based video rendering)
  • Runtime: Bun + Node.js

Limitations

  • Transcription quality depends on audio clarity
  • Very fast speech may have timing issues
  • Background noise can affect accuracy
  • GPU required for reasonable processing times

Ethical Considerations

  • Only use on content you have rights to process
  • Do not use to create misleading or deceptive content
  • Respect copyright and intellectual property

License

MIT License - see LICENSE for details.

Author

Developed by Helder Lima

Model created