PhishViT:A Vision Transformer-Based Framework for Real-Time Phishing Detection from Webpage Screenshots
Loading...
Files
Date
Authors
Jean chrysostome NDAYISABYE
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Abstract: Phishing attacks represent one of the most pervasive and economically devastating cyber threats, with conventional detection systems relying primarily on URL lexical analysis, DNS inspection, and HTML source-code heuristics. These text-centric approaches share a fundamental blind spot: they do not examine the visual rendering of webpages as perceived by human users, leaving a critical detection gap exploited by visual-layer spoofing attacks. This paper presents PhishViT, a Vision Transformer- based framework for real-time phishing detection that operates exclusively on webpage screenshots. Unlike methods that analyze URL strings or page source code, PhishViT learns discriminative visual representations directly from rendered webpage images using a fine-tuned Data-efficient Image Transformer (DeiT-Small) architecture. An automated Playwright browser pipeline captures live screenshots which are classified as phishing or legitimate, with interpretable attention rollout heatmaps generated for each prediction. The framework is developed through an iterative three-phase process, starting from an initial prototype (V1: 253 screenshots, 78.95% accuracy), expanding to a balanced dataset (V2: 642 screenshots, 96.91% accuracy), and culminating in a rigorous top-tier evaluation framework (V3) with comprehen-
sive baseline comparison against ResNet50, EfficientNet-B0, and ViT-Base; 5-fold cross-validation confirming 85.23%±1.18% accuracy; ablation study validating each design choice; and ro-
bustness evaluation under six visual perturbation conditions. V3 evaluation demonstrates that DeiT-Small achieves 91.75% accu- racy, 93.48% precision, 89.58% recall, 91.49% F1-score, and AUC-ROC of 0.9928 at only 5.44 ms inference latency, outper- forming EfficientNet-B0 and ViT-Base while achieving the best efficiency-accuracy trade-off. These results establish PhishViT as a viable, interpretable, and computationally efficient visual-layer phishing detection framework suitable for real-time browser ex- tension deployment.
Description
Citation
DOI
Collections
Endorsement
Review
Supplemented By
Referenced By
Creative Commons license
Except where otherwised noted, this item's license is described as Attribution 3.0 United States
