Video Text Generator
Add TikTok-style captions and custom text overlays to videos using AI transcription.
Overview
This model automatically transcribes videos using OpenAI Whisper large-v3 and generates animated karaoke-style captions with word-by-word highlighting. It also supports custom text overlays for titles, call-to-actions, and other dynamic text elements.
Features
- AI Transcription: OpenAI Whisper large-v3 for accurate speech-to-text in 100+ languages
- Word-by-Word Highlighting: Karaoke-style captions that highlight each word as it’s spoken
- Multiple Caption Styles: “classic” (stroke text) or “boxed” (background boxes like TikTok/Instagram)
- Custom Text Overlays: Add titles, CTAs, and dynamic text with precise timing and positioning
- Customizable Design: Configure colors, sizes, and positions
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
video |
file | required | Video file (MP4, MOV, MKV, WebM) |
caption_style |
string | boxed |
Caption style: “classic” or “boxed” |
caption_size |
int | 60 |
Font size for captions |
highlight_color |
string | #39E508 |
Highlight color (hex format) |
caption_position |
int | 150 |
Distance from bottom in pixels |
text_overlays |
string | "" |
JSON array of text overlays |
Text Overlays Format
[
{
"text": "WATCH THIS!",
"startMs": 0,
"endMs": 2000,
"position": 400,
"fontSize": 80,
"color": "#FFFFFF"
},
{
"text": "Follow for more!",
"startMs": 5000,
"endMs": 7000,
"position": 500,
"fontSize": 50,
"color": "#FF0000",
"backgroundColor": "rgba(0,0,0,0.5)"
}
]
Use Cases
- Social Media Content: Add captions to TikTok, Instagram Reels, YouTube Shorts
- Accessibility: Make videos accessible to deaf/hard-of-hearing viewers
- Marketing: Add titles, CTAs, and promotional text to videos
- Education: Caption educational and tutorial content
- Localization: Transcribe content for translation workflows
Model Architecture
- Transcription: OpenAI Whisper large-v3
- Rendering: Remotion (React-based video rendering)
- Runtime: Bun + Node.js
Limitations
- Transcription quality depends on audio clarity
- Very fast speech may have timing issues
- Background noise can affect accuracy
- GPU required for reasonable processing times
Ethical Considerations
- Only use on content you have rights to process
- Do not use to create misleading or deceptive content
- Respect copyright and intellectual property
License
MIT License - see LICENSE for details.
Author
Developed by Helder Lima
Model created