Readme
Video Text Remover
Clean videos by automatically removing text overlays.
Model Description
This model automatically detects and removes hardcoded text overlays (subtitles, captions, watermarks) from videos using a combination of YOLO (You Only Look Once) object detection and context-aware inpainting algorithms. It preserves video quality while seamlessly removing text, making it ideal for content repurposing, localization, and accessibility improvements.
Key Features
- AI-Powered Detection: Uses YOLOv8 trained specifically for text overlay detection
- Multiple Removal Methods: 6 different inpainting algorithms optimized for various use cases
- High Quality Output: H.264 encoding with configurable quality settings
- GPU Accelerated: Automatic CUDA/TensorRT support for 3-10x faster processing
- Production Ready: Deployed on Replicate for easy API access
Intended Uses
Primary Use Cases
- Content Localization: Remove original language subtitles to add new translations
- Video Editing: Clean footage for re-editing or remixing without text overlays
- Content Repurposing: Prepare videos for different markets or platforms
- Accessibility: Replace hardcoded subtitles with proper closed captions
- Archival: Create clean master copies of video content
Out-of-Scope Uses
- ❌ Removing copyright notices or watermarks from protected content
- ❌ Removing creator credits or mandatory disclosures
- ❌ Circumventing content protection mechanisms
- ❌ Processing content that violates laws or platform policies
How to Use
Replicate Web Interface
Visit replicate.com/hjunior29/video-text-remover and upload your video.
Parameters
video (required)
- Type: Video file
- Description: Input video with hardcoded text to remove
- Supported formats: MP4, AVI, MOV, and other common formats
method (optional)
- Type: String
- Default:
"hybrid" - Options:
hybrid,inpaint,inpaint_ns,blur,black,background - Description: Text removal algorithm
hybrid: Best quality using context-aware inpainting (recommended)inpaint: Fast TELEA inpaintinginpaint_ns: Navier-Stokes inpainting for smooth gradientsblur: Gaussian blur over text regionsblack: Fill with black pixelsbackground: Fill with surrounding color
conf_threshold (optional)
- Type: Float
- Range: 0.0 - 1.0
- Default: 0.25
- Description: Detection confidence threshold. Lower values detect more text but may include false positives.
iou_threshold (optional)
- Type: Float
- Range: 0.0 - 1.0
- Default: 0.45
- Description: Intersection-over-Union threshold for Non-Maximum Suppression
margin (optional)
- Type: Integer
- Range: 0 - 20 pixels
- Default: 5
- Description: Extra pixels to expand around detected text regions
Model Details
Architecture
Detection Model: YOLOv8s-based object detector - Framework: ONNX Runtime with GPU support - Model Size: 27 MB - Parameters: ~9M - Input Resolution: 640x640 (with padding) - Providers: CUDA, TensorRT, or CPU fallback
Removal Methods: 1. Hybrid Inpainting (Recommended): Context-aware TELEA with expanded region 2. TELEA Inpainting: Fast Marching Method-based 3. Navier-Stokes Inpainting: Fluid dynamics-based propagation 4. Gaussian Blur: Makes text unreadable while preserving colors 5. Black Fill: Simple black pixel replacement 6. Background Fill: Samples and fills with surrounding color
Training Data
The YOLOv8 detection model was custom-trained for text overlay detection: - Text types: Hardcoded video subtitles, captions, and on-screen text overlays - Fonts & Styles: Various fonts, sizes, colors, and styling (bold, outlined, shadowed) - Languages: Multi-language support (Latin, Cyrillic, Asian characters, etc.) - Backgrounds: Diverse video content (movies, TV shows, social media, educational content) - Positions: Bottom-centered (most common), top, and custom positioned text - Training Focus: Optimized to detect complete text blocks rather than individual characters
Performance Metrics
- Detection Accuracy: ~95% mAP@0.5 on validation set
- Processing Speed:
- CPU: ~2-5 FPS
- GPU (CUDA): ~15-30 FPS (3-6x faster)
- GPU (TensorRT): ~30-50 FPS (6-10x faster)
- False Positive Rate: <5% with default threshold (0.25)
- Cold Start Time: ~3-5 seconds (model loading)
Limitations and Biases
Technical Limitations
- Text Size: Very small text (<10 pixels) may not be detected reliably
- Transparent Text: Semi-transparent overlays may be difficult to detect
- Complex Backgrounds: Text over highly detailed backgrounds may leave visible artifacts
- Fast Motion: Rapid camera movements can affect detection accuracy
- Stylized Text: Heavily stylized or artistic text may not be detected
Processing Limitations
- Video Length: Very long videos (>1 hour) may require extended processing time
- Resolution: 4K+ videos are downscaled for detection, then upscaled for output
- Frame Rate: High frame rate videos (>60fps) increase processing time proportionally
Ethical Considerations
- Content Rights: Users must have legal rights to modify the video content
- Attribution: Do not use to remove creator credits or mandatory attributions
- Accessibility: Removing hardcoded subtitles may reduce accessibility for some users
- Transparency: Disclose when videos have been modified using this tool
Biases
- Language Bias: Model trained primarily on Latin alphabet text; may have reduced accuracy for other scripts
- Content Bias: Better performance on professional video content (movies, TV shows) than amateur content
- Position Bias: Optimized for bottom-centered subtitles; may have reduced accuracy for text in other positions
Example Results
Input video with hardcoded text: - Original resolution preserved - Text detected with bounding boxes - Inpainting applied to remove text - Video re-encoded with H.264
Before: Video with subtitles “Got my bf a mini truck for Christmas…”
After: Clean video with text seamlessly removed using hybrid inpainting
Environmental Impact
- Hardware Type: NVIDIA T4 GPU (Replicate default)
- Compute Region: Multi-region (US, EU)
- Carbon Footprint: Estimated ~0.01 kg CO2 per minute of video processed
Citation
If you use this model in your research or application, please cite:
@misc{video-text-remover-2025,
author = {Helder Lima},
title = {Video Text Remover: AI-Powered Text Overlay Removal},
year = {2025},
publisher = {Replicate},
howpublished = {\url{https://replicate.com/hjunior29/video-text-remover}}
}
Model Card Authors
Helder Lima
Model Card Contact
For questions, issues, or feedback: - GitHub: hjunior29/video-text-remover - Replicate: hjunior29/video-text-remover
License
MIT License - See LICENSE file for details
Acknowledgments
- YOLO: Ultralytics for the YOLOv8 architecture
- ONNX Runtime: Microsoft for efficient model inference
- OpenCV: For image processing and inpainting algorithms
- Replicate: For model hosting and deployment infrastructure