hjunior29/video-text-generator

Generate animated captions and text overlays for social media videos

Public
47 runs

Video Text Generator

Add TikTok-style captions and custom text overlays to videos using AI transcription.

Overview

This model automatically transcribes videos using OpenAI Whisper large-v3 and generates animated karaoke-style captions with word-by-word highlighting. It also supports custom text overlays for titles, call-to-actions, and other dynamic text elements.

Features

  • AI Transcription: OpenAI Whisper large-v3 for accurate speech-to-text in 100+ languages
  • Word-by-Word Highlighting: Karaoke-style captions that highlight each word as it’s spoken
  • Multiple Caption Styles: “classic” (stroke text) or “boxed” (background boxes like TikTok/Instagram)
  • Custom Text Overlays: Add titles, CTAs, and dynamic text with precise timing and positioning
  • Customizable Design: Configure colors, sizes, and positions

Input Parameters

Parameter Type Default Description
video file required Video file (MP4, MOV, MKV, WebM)
caption_style string boxed Caption style: “classic” or “boxed”
caption_size int 60 Font size for captions
highlight_color string #39E508 Highlight color (hex format)
caption_position int 150 Distance from bottom in pixels
text_overlays string "" JSON array of text overlays

Text Overlays Format

[
  {
    "text": "WATCH THIS!",
    "startMs": 0,
    "endMs": 2000,
    "position": 400,
    "fontSize": 80,
    "color": "#FFFFFF"
  },
  {
    "text": "Follow for more!",
    "startMs": 5000,
    "endMs": 7000,
    "position": 500,
    "fontSize": 50,
    "color": "#FF0000",
    "backgroundColor": "rgba(0,0,0,0.5)"
  }
]

Use Cases

  • Social Media Content: Add captions to TikTok, Instagram Reels, YouTube Shorts
  • Accessibility: Make videos accessible to deaf/hard-of-hearing viewers
  • Marketing: Add titles, CTAs, and promotional text to videos
  • Education: Caption educational and tutorial content
  • Localization: Transcribe content for translation workflows

Model Architecture

  • Transcription: OpenAI Whisper large-v3
  • Rendering: Remotion (React-based video rendering)
  • Runtime: Bun + Node.js

Limitations

  • Transcription quality depends on audio clarity
  • Very fast speech may have timing issues
  • Background noise can affect accuracy
  • GPU required for reasonable processing times

Ethical Considerations

  • Only use on content you have rights to process
  • Do not use to create misleading or deceptive content
  • Respect copyright and intellectual property

License

MIT License - see LICENSE for details.

Author

Developed by Helder Lima

Model created