hjunior29/video-text-remover

Clean videos by automatically removing text overlays

Public
37 runs

Video Text Remover

Clean videos by automatically removing text overlays.

Model Description

This model automatically detects and removes hardcoded text overlays (subtitles, captions, watermarks) from videos using a combination of YOLO (You Only Look Once) object detection and context-aware inpainting algorithms. It preserves video quality while seamlessly removing text, making it ideal for content repurposing, localization, and accessibility improvements.

Key Features

  • AI-Powered Detection: Uses YOLOv8 trained specifically for text overlay detection
  • Multiple Removal Methods: 6 different inpainting algorithms optimized for various use cases
  • High Quality Output: H.264 encoding with configurable quality settings
  • GPU Accelerated: Automatic CUDA/TensorRT support for 3-10x faster processing
  • Production Ready: Deployed on Replicate for easy API access

Intended Uses

Primary Use Cases

  • Content Localization: Remove original language subtitles to add new translations
  • Video Editing: Clean footage for re-editing or remixing without text overlays
  • Content Repurposing: Prepare videos for different markets or platforms
  • Accessibility: Replace hardcoded subtitles with proper closed captions
  • Archival: Create clean master copies of video content

Out-of-Scope Uses

  • ❌ Removing copyright notices or watermarks from protected content
  • ❌ Removing creator credits or mandatory disclosures
  • ❌ Circumventing content protection mechanisms
  • ❌ Processing content that violates laws or platform policies

How to Use

Replicate Web Interface

Visit replicate.com/hjunior29/video-text-remover and upload your video.

Parameters

video (required)

  • Type: Video file
  • Description: Input video with hardcoded text to remove
  • Supported formats: MP4, AVI, MOV, and other common formats

method (optional)

  • Type: String
  • Default: "hybrid"
  • Options: hybrid, inpaint, inpaint_ns, blur, black, background
  • Description: Text removal algorithm
  • hybrid: Best quality using context-aware inpainting (recommended)
  • inpaint: Fast TELEA inpainting
  • inpaint_ns: Navier-Stokes inpainting for smooth gradients
  • blur: Gaussian blur over text regions
  • black: Fill with black pixels
  • background: Fill with surrounding color

conf_threshold (optional)

  • Type: Float
  • Range: 0.0 - 1.0
  • Default: 0.25
  • Description: Detection confidence threshold. Lower values detect more text but may include false positives.

iou_threshold (optional)

  • Type: Float
  • Range: 0.0 - 1.0
  • Default: 0.45
  • Description: Intersection-over-Union threshold for Non-Maximum Suppression

margin (optional)

  • Type: Integer
  • Range: 0 - 20 pixels
  • Default: 5
  • Description: Extra pixels to expand around detected text regions

Model Details

Architecture

Detection Model: YOLOv8s-based object detector - Framework: ONNX Runtime with GPU support - Model Size: 27 MB - Parameters: ~9M - Input Resolution: 640x640 (with padding) - Providers: CUDA, TensorRT, or CPU fallback

Removal Methods: 1. Hybrid Inpainting (Recommended): Context-aware TELEA with expanded region 2. TELEA Inpainting: Fast Marching Method-based 3. Navier-Stokes Inpainting: Fluid dynamics-based propagation 4. Gaussian Blur: Makes text unreadable while preserving colors 5. Black Fill: Simple black pixel replacement 6. Background Fill: Samples and fills with surrounding color

Training Data

The YOLOv8 detection model was custom-trained for text overlay detection: - Text types: Hardcoded video subtitles, captions, and on-screen text overlays - Fonts & Styles: Various fonts, sizes, colors, and styling (bold, outlined, shadowed) - Languages: Multi-language support (Latin, Cyrillic, Asian characters, etc.) - Backgrounds: Diverse video content (movies, TV shows, social media, educational content) - Positions: Bottom-centered (most common), top, and custom positioned text - Training Focus: Optimized to detect complete text blocks rather than individual characters

Performance Metrics

  • Detection Accuracy: ~95% mAP@0.5 on validation set
  • Processing Speed:
  • CPU: ~2-5 FPS
  • GPU (CUDA): ~15-30 FPS (3-6x faster)
  • GPU (TensorRT): ~30-50 FPS (6-10x faster)
  • False Positive Rate: <5% with default threshold (0.25)
  • Cold Start Time: ~3-5 seconds (model loading)

Limitations and Biases

Technical Limitations

  • Text Size: Very small text (<10 pixels) may not be detected reliably
  • Transparent Text: Semi-transparent overlays may be difficult to detect
  • Complex Backgrounds: Text over highly detailed backgrounds may leave visible artifacts
  • Fast Motion: Rapid camera movements can affect detection accuracy
  • Stylized Text: Heavily stylized or artistic text may not be detected

Processing Limitations

  • Video Length: Very long videos (>1 hour) may require extended processing time
  • Resolution: 4K+ videos are downscaled for detection, then upscaled for output
  • Frame Rate: High frame rate videos (>60fps) increase processing time proportionally

Ethical Considerations

  • Content Rights: Users must have legal rights to modify the video content
  • Attribution: Do not use to remove creator credits or mandatory attributions
  • Accessibility: Removing hardcoded subtitles may reduce accessibility for some users
  • Transparency: Disclose when videos have been modified using this tool

Biases

  • Language Bias: Model trained primarily on Latin alphabet text; may have reduced accuracy for other scripts
  • Content Bias: Better performance on professional video content (movies, TV shows) than amateur content
  • Position Bias: Optimized for bottom-centered subtitles; may have reduced accuracy for text in other positions

Example Results

Input video with hardcoded text: - Original resolution preserved - Text detected with bounding boxes - Inpainting applied to remove text - Video re-encoded with H.264

Before: Video with subtitles “Got my bf a mini truck for Christmas…”

After: Clean video with text seamlessly removed using hybrid inpainting

Environmental Impact

  • Hardware Type: NVIDIA T4 GPU (Replicate default)
  • Compute Region: Multi-region (US, EU)
  • Carbon Footprint: Estimated ~0.01 kg CO2 per minute of video processed

Citation

If you use this model in your research or application, please cite:

@misc{video-text-remover-2025,
  author = {Helder Lima},
  title = {Video Text Remover: AI-Powered Text Overlay Removal},
  year = {2025},
  publisher = {Replicate},
  howpublished = {\url{https://replicate.com/hjunior29/video-text-remover}}
}

Model Card Authors

Helder Lima

Model Card Contact

For questions, issues, or feedback: - GitHub: hjunior29/video-text-remover - Replicate: hjunior29/video-text-remover

License

MIT License - See LICENSE file for details

Acknowledgments

  • YOLO: Ultralytics for the YOLOv8 architecture
  • ONNX Runtime: Microsoft for efficient model inference
  • OpenCV: For image processing and inpainting algorithms
  • Replicate: For model hosting and deployment infrastructure