PhishViT:A Vision Transformer-Based Framework for Real-Time Phishing Detection from Webpage Screenshots
| dc.contributor.author | Jean chrysostome NDAYISABYE | |
| dc.date.accessioned | 2026-04-13T08:01:19Z | |
| dc.date.issued | 2026 | |
| dc.description.abstract | Abstract: Phishing attacks represent one of the most pervasive and economically devastating cyber threats, with conventional detection systems relying primarily on URL lexical analysis, DNS inspection, and HTML source-code heuristics. These text-centric approaches share a fundamental blind spot: they do not examine the visual rendering of webpages as perceived by human users, leaving a critical detection gap exploited by visual-layer spoofing attacks. This paper presents PhishViT, a Vision Transformer- based framework for real-time phishing detection that operates exclusively on webpage screenshots. Unlike methods that analyze URL strings or page source code, PhishViT learns discriminative visual representations directly from rendered webpage images using a fine-tuned Data-efficient Image Transformer (DeiT-Small) architecture. An automated Playwright browser pipeline captures live screenshots which are classified as phishing or legitimate, with interpretable attention rollout heatmaps generated for each prediction. The framework is developed through an iterative three-phase process, starting from an initial prototype (V1: 253 screenshots, 78.95% accuracy), expanding to a balanced dataset (V2: 642 screenshots, 96.91% accuracy), and culminating in a rigorous top-tier evaluation framework (V3) with comprehen- sive baseline comparison against ResNet50, EfficientNet-B0, and ViT-Base; 5-fold cross-validation confirming 85.23%±1.18% accuracy; ablation study validating each design choice; and ro- bustness evaluation under six visual perturbation conditions. V3 evaluation demonstrates that DeiT-Small achieves 91.75% accu- racy, 93.48% precision, 89.58% recall, 91.49% F1-score, and AUC-ROC of 0.9928 at only 5.44 ms inference latency, outper- forming EfficientNet-B0 and ViT-Base while achieving the best efficiency-accuracy trade-off. These results establish PhishViT as a viable, interpretable, and computationally efficient visual-layer phishing detection framework suitable for real-time browser ex- tension deployment. | |
| dc.description.provenance | Submitted by Jean Chrysostome NDAYISABYE (ndayisabyejeanchrysostome@gmail.com) on 2026-04-10T17:11:29Z workflow start=Step: reviewstep - action:claimaction No. of bitstreams: 2 PhishViT.pdf: 969931 bytes, checksum: 0164c492f89d1713b81479f201d320c9 (MD5) license_rdf: 1025 bytes, checksum: 5fbab3a8de1b8b11fce4c9bca21b0aab (MD5) | en |
| dc.description.provenance | Step: reviewstep - action:reviewaction Approved for entry into archive by Jo Havemann (jo@africarxiv.org) on 2026-04-13T08:01:19Z (GMT) | en |
| dc.description.provenance | Made available in DSpace on 2026-04-13T08:01:19Z (GMT). No. of bitstreams: 2 PhishViT.pdf: 969931 bytes, checksum: 0164c492f89d1713b81479f201d320c9 (MD5) license_rdf: 1025 bytes, checksum: 5fbab3a8de1b8b11fce4c9bca21b0aab (MD5) Previous issue date: 2026 | en |
| dc.identifier.uri | https://africarxiv.ubuntunet.net/handle/1/11307 | |
| dc.language.iso | en | |
| dc.rights | Attribution 3.0 United States | en |
| dc.rights.uri | http://creativecommons.org/licenses/by/3.0/us/ | |
| dc.subject | Phishing Detection | |
| dc.subject | Vision Transformer | |
| dc.subject | DeiT-Small | |
| dc.subject | Screenshot Classification | |
| dc.subject | Cybersecurity | |
| dc.subject | Deep Learning | |
| dc.subject | Attention Rollout | |
| dc.subject | Cross-Validation | |
| dc.subject | Robustness Evaluation. | |
| dc.title | PhishViT:A Vision Transformer-Based Framework for Real-Time Phishing Detection from Webpage Screenshots | |
| dc.type | Working Paper |