Readme

Video Text Remover

Clean videos by automatically removing text overlays.

Model Description

This model automatically detects and removes hardcoded text overlays (subtitles, captions, watermarks) from videos using a combination of YOLO (You Only Look Once) object detection and context-aware inpainting algorithms. It preserves video quality while seamlessly removing text, making it ideal for content repurposing, localization, and accessibility improvements.

Key Features

AI-Powered Detection: Uses YOLOv8 trained specifically for text overlay detection
Multiple Removal Methods: 6 different inpainting algorithms optimized for various use cases
High Quality Output: H.264 encoding with configurable quality settings
GPU Accelerated: Automatic CUDA/TensorRT support for 3-10x faster processing
Production Ready: Deployed on Replicate for easy API access

Intended Uses

Primary Use Cases

Content Localization: Remove original language subtitles to add new translations
Video Editing: Clean footage for re-editing or remixing without text overlays
Content Repurposing: Prepare videos for different markets or platforms
Accessibility: Replace hardcoded subtitles with proper closed captions
Archival: Create clean master copies of video content

Out-of-Scope Uses

❌ Removing copyright notices or watermarks from protected content
❌ Removing creator credits or mandatory disclosures
❌ Circumventing content protection mechanisms
❌ Processing content that violates laws or platform policies

How to Use

Replicate Web Interface

Visit replicate.com/hjunior29/video-text-remover and upload your video.

Parameters

`video` (required)

Type: Video file
Description: Input video with hardcoded text to remove
Supported formats: MP4, AVI, MOV, and other common formats

`method` (optional)

Type: String
Default: "hybrid"
Options: hybrid, inpaint, inpaint_ns, blur, black, background
Description: Text removal algorithm
hybrid: Best quality using context-aware inpainting (recommended)
inpaint: Fast TELEA inpainting
inpaint_ns: Navier-Stokes inpainting for smooth gradients
blur: Gaussian blur over text regions
black: Fill with black pixels
background: Fill with surrounding color

`conf_threshold` (optional)

Type: Float
Range: 0.0 - 1.0
Default: 0.25
Description: Detection confidence threshold. Lower values detect more text but may include false positives.

`iou_threshold` (optional)

Type: Float
Range: 0.0 - 1.0
Default: 0.45
Description: Intersection-over-Union threshold for Non-Maximum Suppression

`margin` (optional)

Type: Integer
Range: 0 - 20 pixels
Default: 5
Description: Extra pixels to expand around detected text regions

Model Details

Architecture

Detection Model: YOLOv8s-based object detector - Framework: ONNX Runtime with GPU support - Model Size: 27 MB - Parameters: ~9M - Input Resolution: 640x640 (with padding) - Providers: CUDA, TensorRT, or CPU fallback

Removal Methods: 1. Hybrid Inpainting (Recommended): Context-aware TELEA with expanded region 2. TELEA Inpainting: Fast Marching Method-based 3. Navier-Stokes Inpainting: Fluid dynamics-based propagation 4. Gaussian Blur: Makes text unreadable while preserving colors 5. Black Fill: Simple black pixel replacement 6. Background Fill: Samples and fills with surrounding color

Training Data

The YOLOv8 detection model was custom-trained for text overlay detection: - Text types: Hardcoded video subtitles, captions, and on-screen text overlays - Fonts & Styles: Various fonts, sizes, colors, and styling (bold, outlined, shadowed) - Languages: Multi-language support (Latin, Cyrillic, Asian characters, etc.) - Backgrounds: Diverse video content (movies, TV shows, social media, educational content) - Positions: Bottom-centered (most common), top, and custom positioned text - Training Focus: Optimized to detect complete text blocks rather than individual characters

Performance Metrics

Detection Accuracy: ~95% mAP@0.5 on validation set
Processing Speed:
CPU: ~2-5 FPS
GPU (CUDA): ~15-30 FPS (3-6x faster)
GPU (TensorRT): ~30-50 FPS (6-10x faster)
False Positive Rate: <5% with default threshold (0.25)
Cold Start Time: ~3-5 seconds (model loading)

Limitations and Biases

Technical Limitations

Text Size: Very small text (<10 pixels) may not be detected reliably
Transparent Text: Semi-transparent overlays may be difficult to detect
Complex Backgrounds: Text over highly detailed backgrounds may leave visible artifacts
Fast Motion: Rapid camera movements can affect detection accuracy
Stylized Text: Heavily stylized or artistic text may not be detected

Processing Limitations

Video Length: Very long videos (>1 hour) may require extended processing time
Resolution: 4K+ videos are downscaled for detection, then upscaled for output
Frame Rate: High frame rate videos (>60fps) increase processing time proportionally

Ethical Considerations

Content Rights: Users must have legal rights to modify the video content
Attribution: Do not use to remove creator credits or mandatory attributions
Accessibility: Removing hardcoded subtitles may reduce accessibility for some users
Transparency: Disclose when videos have been modified using this tool

Biases

Language Bias: Model trained primarily on Latin alphabet text; may have reduced accuracy for other scripts
Content Bias: Better performance on professional video content (movies, TV shows) than amateur content
Position Bias: Optimized for bottom-centered subtitles; may have reduced accuracy for text in other positions

Example Results

Input video with hardcoded text: - Original resolution preserved - Text detected with bounding boxes - Inpainting applied to remove text - Video re-encoded with H.264

Before: Video with subtitles “Got my bf a mini truck for Christmas…”

After: Clean video with text seamlessly removed using hybrid inpainting

Environmental Impact

Hardware Type: NVIDIA T4 GPU (Replicate default)
Compute Region: Multi-region (US, EU)
Carbon Footprint: Estimated ~0.01 kg CO2 per minute of video processed

Citation

If you use this model in your research or application, please cite:

@misc{video-text-remover-2025,
  author = {Helder Lima},
  title = {Video Text Remover: AI-Powered Text Overlay Removal},
  year = {2025},
  publisher = {Replicate},
  howpublished = {\url{https://replicate.com/hjunior29/video-text-remover}}
}

Model Card Authors

Helder Lima

Model Card Contact

For questions, issues, or feedback: - GitHub: hjunior29/video-text-remover - Replicate: hjunior29/video-text-remover

License

MIT License - See LICENSE file for details

Acknowledgments

YOLO: Ultralytics for the YOLOv8 architecture
ONNX Runtime: Microsoft for efficient model inference
OpenCV: For image processing and inpainting algorithms
Replicate: For model hosting and deployment infrastructure

Examples

Run time and cost

Readme

Video Text Remover

Model Description

Key Features

Intended Uses

Primary Use Cases

Out-of-Scope Uses

How to Use

Replicate Web Interface

Parameters

`video` (required)

`method` (optional)

`conf_threshold` (optional)

`iou_threshold` (optional)

`margin` (optional)

Model Details

Architecture

Training Data

Performance Metrics

Limitations and Biases

Technical Limitations

Processing Limitations

Ethical Considerations

Biases

Example Results

Environmental Impact

Citation

Model Card Authors

Model Card Contact

License

Acknowledgments

Examples

Run time and cost

Readme

Video Text Remover

Model Description

Key Features

Intended Uses

Primary Use Cases

Out-of-Scope Uses

How to Use

Replicate Web Interface

Parameters

video (required)

method (optional)

conf_threshold (optional)

iou_threshold (optional)

margin (optional)

Model Details

Architecture

Training Data

Performance Metrics

Limitations and Biases

Technical Limitations

Processing Limitations

Ethical Considerations

Biases

Example Results

Environmental Impact

Citation

Model Card Authors

Model Card Contact

License

Acknowledgments

`video` (required)

`method` (optional)

`conf_threshold` (optional)

`iou_threshold` (optional)

`margin` (optional)