PhishViT:A Vision Transformer-Based Framework for Real-Time Phishing Detection from Webpage Screenshots

dc.contributor.authorJean chrysostome NDAYISABYE
dc.date.accessioned2026-04-13T08:01:19Z
dc.date.issued2026
dc.description.abstractAbstract: Phishing attacks represent one of the most pervasive and economically devastating cyber threats, with conventional detection systems relying primarily on URL lexical analysis, DNS inspection, and HTML source-code heuristics. These text-centric approaches share a fundamental blind spot: they do not examine the visual rendering of webpages as perceived by human users, leaving a critical detection gap exploited by visual-layer spoofing attacks. This paper presents PhishViT, a Vision Transformer- based framework for real-time phishing detection that operates exclusively on webpage screenshots. Unlike methods that analyze URL strings or page source code, PhishViT learns discriminative visual representations directly from rendered webpage images using a fine-tuned Data-efficient Image Transformer (DeiT-Small) architecture. An automated Playwright browser pipeline captures live screenshots which are classified as phishing or legitimate, with interpretable attention rollout heatmaps generated for each prediction. The framework is developed through an iterative three-phase process, starting from an initial prototype (V1: 253 screenshots, 78.95% accuracy), expanding to a balanced dataset (V2: 642 screenshots, 96.91% accuracy), and culminating in a rigorous top-tier evaluation framework (V3) with comprehen- sive baseline comparison against ResNet50, EfficientNet-B0, and ViT-Base; 5-fold cross-validation confirming 85.23%±1.18% accuracy; ablation study validating each design choice; and ro- bustness evaluation under six visual perturbation conditions. V3 evaluation demonstrates that DeiT-Small achieves 91.75% accu- racy, 93.48% precision, 89.58% recall, 91.49% F1-score, and AUC-ROC of 0.9928 at only 5.44 ms inference latency, outper- forming EfficientNet-B0 and ViT-Base while achieving the best efficiency-accuracy trade-off. These results establish PhishViT as a viable, interpretable, and computationally efficient visual-layer phishing detection framework suitable for real-time browser ex- tension deployment.
dc.description.provenanceSubmitted by Jean Chrysostome NDAYISABYE (ndayisabyejeanchrysostome@gmail.com) on 2026-04-10T17:11:29Z workflow start=Step: reviewstep - action:claimaction No. of bitstreams: 2 PhishViT.pdf: 969931 bytes, checksum: 0164c492f89d1713b81479f201d320c9 (MD5) license_rdf: 1025 bytes, checksum: 5fbab3a8de1b8b11fce4c9bca21b0aab (MD5)en
dc.description.provenanceStep: reviewstep - action:reviewaction Approved for entry into archive by Jo Havemann (jo@africarxiv.org) on 2026-04-13T08:01:19Z (GMT)en
dc.description.provenanceMade available in DSpace on 2026-04-13T08:01:19Z (GMT). No. of bitstreams: 2 PhishViT.pdf: 969931 bytes, checksum: 0164c492f89d1713b81479f201d320c9 (MD5) license_rdf: 1025 bytes, checksum: 5fbab3a8de1b8b11fce4c9bca21b0aab (MD5) Previous issue date: 2026en
dc.identifier.urihttps://africarxiv.ubuntunet.net/handle/1/11307
dc.language.isoen
dc.rightsAttribution 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/us/
dc.subjectPhishing Detection
dc.subjectVision Transformer
dc.subjectDeiT-Small
dc.subjectScreenshot Classification
dc.subjectCybersecurity
dc.subjectDeep Learning
dc.subjectAttention Rollout
dc.subjectCross-Validation
dc.subjectRobustness Evaluation.
dc.titlePhishViT:A Vision Transformer-Based Framework for Real-Time Phishing Detection from Webpage Screenshots
dc.typeWorking Paper

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
PhishViT.pdf
Size:
947.2 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.22 KB
Format:
Item-specific license agreed to upon submission
Description:

Collections