PhishViT:A Vision Transformer-Based Framework for Real-Time Phishing Detection from Webpage Screenshots

Loading...
Thumbnail Image

Date

Authors

Jean chrysostome NDAYISABYE

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Abstract: Phishing attacks represent one of the most pervasive and economically devastating cyber threats, with conventional detection systems relying primarily on URL lexical analysis, DNS inspection, and HTML source-code heuristics. These text-centric approaches share a fundamental blind spot: they do not examine the visual rendering of webpages as perceived by human users, leaving a critical detection gap exploited by visual-layer spoofing attacks. This paper presents PhishViT, a Vision Transformer- based framework for real-time phishing detection that operates exclusively on webpage screenshots. Unlike methods that analyze URL strings or page source code, PhishViT learns discriminative visual representations directly from rendered webpage images using a fine-tuned Data-efficient Image Transformer (DeiT-Small) architecture. An automated Playwright browser pipeline captures live screenshots which are classified as phishing or legitimate, with interpretable attention rollout heatmaps generated for each prediction. The framework is developed through an iterative three-phase process, starting from an initial prototype (V1: 253 screenshots, 78.95% accuracy), expanding to a balanced dataset (V2: 642 screenshots, 96.91% accuracy), and culminating in a rigorous top-tier evaluation framework (V3) with comprehen- sive baseline comparison against ResNet50, EfficientNet-B0, and ViT-Base; 5-fold cross-validation confirming 85.23%±1.18% accuracy; ablation study validating each design choice; and ro- bustness evaluation under six visual perturbation conditions. V3 evaluation demonstrates that DeiT-Small achieves 91.75% accu- racy, 93.48% precision, 89.58% recall, 91.49% F1-score, and AUC-ROC of 0.9928 at only 5.44 ms inference latency, outper- forming EfficientNet-B0 and ViT-Base while achieving the best efficiency-accuracy trade-off. These results establish PhishViT as a viable, interpretable, and computationally efficient visual-layer phishing detection framework suitable for real-time browser ex- tension deployment.

Description

Citation

DOI

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as Attribution 3.0 United States