Real-Time Inappropriate Content Detection on YouTube Using CLIP: A Zero-Shot Vision-Language Approach
The rapid expansion of online video platforms has significantly increased children’s exposure to potentially harmful content, including violent and explicit material. Traditional moderation techniques, such as keyword-based filtering and static blocklists, are insufficient to address the dynamic and multimodal nature of modern digital media. This study proposes a real-time content moderation system that integrates a browser extension with a guardian monitoring platform, enabling continuous supervision of YouTube video consumption.
The system leverages the CLIP (Contrastive Language–Image Pretraining) model to perform zero-shot classification of video frames by aligning visual and textual representations in a shared semantic space. The methodology involves periodic frame sampling, preprocessing, and similarity-based classification using predefined harmful and safe content labels. A dual-pass decision mechanism, combined with temporal consistency filtering, is employed to improve detection reliability and reduce false positives.
Experimental evaluation on a labeled dataset demonstrates that the proposed system achieves an accuracy of 78%, with a high recall for harmful content detection. The results indicate that the system effectively prioritizes safety by minimizing undetected harmful content while maintaining acceptable precision levels.
Overall, the proposed approach highlights the practical potential of zero-shot learning for real-time content moderation in dynamic environments. The system provides an effective, scalable, and privacy-aware solution for enhancing child safety in online video platforms.



