Video Text Generator

Add TikTok-style captions and custom text overlays to videos using AI transcription.

Overview

This model automatically transcribes videos using OpenAI Whisper large-v3 and generates animated karaoke-style captions with word-by-word highlighting. It also supports custom text overlays for titles, call-to-actions, and other dynamic text elements.

Features

AI Transcription: OpenAI Whisper large-v3 for accurate speech-to-text in 100+ languages
Word-by-Word Highlighting: Karaoke-style captions that highlight each word as it’s spoken
Multiple Caption Styles: “classic” (stroke text) or “boxed” (background boxes like TikTok/Instagram)
Custom Text Overlays: Add titles, CTAs, and dynamic text with precise timing and positioning
Customizable Design: Configure colors, sizes, and positions

Input Parameters

Parameter	Type	Default	Description
`video`	file	required	Video file (MP4, MOV, MKV, WebM)
`caption_style`	string	`boxed`	Caption style: “classic” or “boxed”
`caption_size`	int	`60`	Font size for captions
`highlight_color`	string	`#39E508`	Highlight color (hex format)
`caption_position`	int	`150`	Distance from bottom in pixels
`text_overlays`	string	`""`	JSON array of text overlays

Text Overlays Format

[
  {
    "text": "WATCH THIS!",
    "startMs": 0,
    "endMs": 2000,
    "position": 400,
    "fontSize": 80,
    "color": "#FFFFFF"
  },
  {
    "text": "Follow for more!",
    "startMs": 5000,
    "endMs": 7000,
    "position": 500,
    "fontSize": 50,
    "color": "#FF0000",
    "backgroundColor": "rgba(0,0,0,0.5)"
  }
]

Use Cases

Social Media Content: Add captions to TikTok, Instagram Reels, YouTube Shorts
Accessibility: Make videos accessible to deaf/hard-of-hearing viewers
Marketing: Add titles, CTAs, and promotional text to videos
Education: Caption educational and tutorial content
Localization: Transcribe content for translation workflows

Model Architecture

Transcription: OpenAI Whisper large-v3
Rendering: Remotion (React-based video rendering)
Runtime: Bun + Node.js

Limitations

Transcription quality depends on audio clarity
Very fast speech may have timing issues
Background noise can affect accuracy
GPU required for reasonable processing times

Ethical Considerations

Only use on content you have rights to process
Do not use to create misleading or deceptive content
Respect copyright and intellectual property

License

MIT License - see LICENSE for details.

Author

Developed by Helder Lima

Model created 3 months ago