zhixindev/nsfw-image-detection-2026

Fine-Tuned Vision Transformer (ViT) for NSFW Image Classification (2026)

Public
14 runs

About

Cog implementation of Falconsai/nsfw_image_detection_2026. This is the enhanced, enterprise-ready version of the widely used NSFW image classification model.

Model Card

Fine-Tuned Vision Transformer (ViT) V2 for High-Precision NSFW Image Classification (2026 Edition).

Model Description

The Fine-Tuned Vision Transformer (ViT) V2 is an advanced iteration of the transformer encoder architecture, adapted for high-precision content moderation. This 2026 edition is built upon the “google/vit-base-patch16-224-in21k” baseline and has been rigorously retrained to deliver unprecedented accuracy.

During the 2026 training phase, meticulous attention was given to hyperparameter optimization, implementing a dynamic learning rate scheduler and an effective batch size of 64. This allows the model to process complex visual contexts more effectively than its predecessor.

The most significant upgrade is the expanded dataset. Moving beyond the legacy 80,000-image corpus, this model was trained on a meticulously curated proprietary dataset of over 1.25 million images. This massive increase in data variability allows the model to capture highly nuanced patterns and significantly reduce false positives in borderline cases like classical art or medical imagery.

The result is a robust, enterprise-ready model that sets a new benchmark for automated content safety and trust-and-safety compliance.

Performance Comparison

Metric Legacy Version (80k) 2026 Version (1.2M) Improvement
Evaluation Accuracy 98.03% 99.71% +1.68%
Evaluation Loss 0.0746 0.0124 Significant reduction
Samples per Second 52.46 86.15 +64% throughput

Intended Uses & Limitations

NSFW Image Classification: The primary intended use of this model is for the real-time classification and filtering of NSFW (Not Safe for Work) images. It has been fine-tuned for this purpose, making it suitable for filtering explicit or inappropriate content in various applications.

This model will return either the word: “normal” or “nsfw”.

Note: This model assumes input images are RGB. To ensure the highest standards of accuracy and reliability, please convert images to RGB before inference.

Ethical Considerations

It is essential to use this model responsibly and ethically, adhering to content guidelines and applicable regulations. While heavily optimized, the definition of NSFW can vary culturally. Users should calibrate confidence thresholds based on their specific community guidelines.

Model created
Model updated