Video Text Remover

Clean videos by automatically removing text overlays.

Model Description

This model automatically detects and removes hardcoded text overlays (subtitles, captions, watermarks) from videos using a combination of YOLO (You Only Look Once) object detection and context-aware inpainting algorithms. It preserves video quality while seamlessly removing text, making it ideal for content repurposing, localization, and accessibility improvements.

Key Features

AI-Powered Detection: Uses YOLOv8 trained specifically for text overlay detection
Multiple Removal Methods: 6 different inpainting algorithms optimized for various use cases
High Quality Output: H.264 encoding with configurable quality settings
GPU Accelerated: Automatic CUDA/TensorRT support for 3-10x faster processing
Production Ready: Deployed on Replicate for easy API access

Intended Uses

Primary Use Cases

Content Localization: Remove original language subtitles to add new translations
Video Editing: Clean footage for re-editing or remixing without text overlays
Content Repurposing: Prepare videos for different markets or platforms
Accessibility: Replace hardcoded subtitles with proper closed captions
Archival: Create clean master copies of video content

Out-of-Scope Uses

❌ Removing copyright notices or watermarks from protected content
❌ Removing creator credits or mandatory disclosures
❌ Circumventing content protection mechanisms
❌ Processing content that violates laws or platform policies

How to Use

Replicate Web Interface

Visit replicate.com/hjunior29/video-text-remover and upload your video.

Parameters

`video` (required)

Type: Video file
Description: Input video with hardcoded text to remove
Supported formats: MP4, AVI, MOV, and other common formats

`method` (optional)

Type: String
Default: "hybrid"
Options: hybrid, inpaint, inpaint_ns, blur, black, background
Description: Text removal algorithm
hybrid: Best quality using context-aware inpainting (recommended)
inpaint: Fast TELEA inpainting
inpaint_ns: Navier-Stokes inpainting for smooth gradients
blur: Gaussian blur over text regions
black: Fill with black pixels
background: Fill with surrounding color

`conf_threshold` (optional)

Type: Float
Range: 0.0 - 1.0
Default: 0.25
Description: Detection confidence threshold. Lower values detect more text but may include false positives.

`iou_threshold` (optional)

Type: Float
Range: 0.0 - 1.0
Default: 0.45
Description: Intersection-over-Union threshold for Non-Maximum Suppression

`margin` (optional)

Type: Integer
Range: 0 - 20 pixels
Default: 5
Description: Extra pixels to expand around detected text regions

Model Details

Architecture

Detection Model: YOLOv8s-based object detector - Framework: ONNX Runtime with GPU support - Model Size: 27 MB - Parameters: ~9M - Input Resolution: 640x640 (with padding) - Providers: CUDA, TensorRT, or CPU fallback

Removal Methods: 1. Hybrid Inpainting (Recommended): Context-aware TELEA with expanded region 2. TELEA Inpainting: Fast Marching Method-based 3. Navier-Stokes Inpainting: Fluid dynamics-based propagation 4. Gaussian Blur: Makes text unreadable while preserving colors 5. Black Fill: Simple black pixel replacement 6. Background Fill: Samples and fills with surrounding color

Training Data

The YOLOv8 detection model was custom-trained for text overlay detection: - Text types: Hardcoded video subtitles, captions, and on-screen text overlays - Fonts & Styles: Various fonts, sizes, colors, and styling (bold, outlined, shadowed) - Languages: Multi-language support (Latin, Cyrillic, Asian characters, etc.) - Backgrounds: Diverse video content (movies, TV shows, social media, educational content) - Positions: Bottom-centered (most common), top, and custom positioned text - Training Focus: Optimized to detect complete text blocks rather than individual characters

Performance Metrics

Detection Accuracy: ~95% mAP@0.5 on validation set
Processing Speed:
CPU: ~2-5 FPS
GPU (CUDA): ~15-30 FPS (3-6x faster)
GPU (TensorRT): ~30-50 FPS (6-10x faster)
False Positive Rate: <5% with default threshold (0.25)
Cold Start Time: ~3-5 seconds (model loading)

Limitations and Biases

Technical Limitations

Text Size: Very small text (<10 pixels) may not be detected reliably
Transparent Text: Semi-transparent overlays may be difficult to detect
Complex Backgrounds: Text over highly detailed backgrounds may leave visible artifacts
Fast Motion: Rapid camera movements can affect detection accuracy
Stylized Text: Heavily stylized or artistic text may not be detected

Processing Limitations

Video Length: Very long videos (>1 hour) may require extended processing time
Resolution: 4K+ videos are downscaled for detection, then upscaled for output
Frame Rate: High frame rate videos (>60fps) increase processing time proportionally

Ethical Considerations

Content Rights: Users must have legal rights to modify the video content
Attribution: Do not use to remove creator credits or mandatory attributions
Accessibility: Removing hardcoded subtitles may reduce accessibility for some users
Transparency: Disclose when videos have been modified using this tool

Biases

Language Bias: Model trained primarily on Latin alphabet text; may have reduced accuracy for other scripts
Content Bias: Better performance on professional video content (movies, TV shows) than amateur content
Position Bias: Optimized for bottom-centered subtitles; may have reduced accuracy for text in other positions

Example Results

Input video with hardcoded text: - Original resolution preserved - Text detected with bounding boxes - Inpainting applied to remove text - Video re-encoded with H.264

Before: Video with subtitles “Got my bf a mini truck for Christmas…”

After: Clean video with text seamlessly removed using hybrid inpainting

Environmental Impact

Hardware Type: NVIDIA T4 GPU (Replicate default)
Compute Region: Multi-region (US, EU)
Carbon Footprint: Estimated ~0.01 kg CO2 per minute of video processed

Citation

If you use this model in your research or application, please cite:

@misc{video-text-remover-2025,
  author = {Helder Lima},
  title = {Video Text Remover: AI-Powered Text Overlay Removal},
  year = {2025},
  publisher = {Replicate},
  howpublished = {\url{https://replicate.com/hjunior29/video-text-remover}}
}

Model Card Authors

Helder Lima

Model Card Contact

For questions, issues, or feedback: - GitHub: hjunior29/video-text-remover - Replicate: hjunior29/video-text-remover

License

MIT License - See LICENSE file for details

Acknowledgments

YOLO: Ultralytics for the YOLOv8 architecture
ONNX Runtime: Microsoft for efficient model inference
OpenCV: For image processing and inpainting algorithms
Replicate: For model hosting and deployment infrastructure