Abstract

Phishing attacks represent one of the most significant and persistent threats in the cybersecu- rity landscape, with attackers increasingly using sophisticated URL manipulation techniques to deceive users and steal sensitive information. Traditional detection methods, which rely primarily on blacklists and heuristic rules, struggle to identify zero-day phishing URLs that have not yet been catalogued in security databases. This research addresses this critical gap by developing an interpretable machine learning framework for detecting phishing URLs usingclexical, structural, content-based, and domain metadata features. The study employs a comprehensive dataset of 11,430 labeled URLs (5,715 legitimate and 5,715 phishing) with 87 extracted features, categorized into structural (49 features), content- based (15 features), and metadata (23 features) attributes. Through extensive exploratory data analysis, the research identifies key discriminative patterns between legitimate and phishing URLs, including URL length, domain age, page rank, and Google indexing status. Four machine learning algorithms—Decision Tree, Random Forest, Support Vector Machine (SVM), and LightGBM—were trained, optimized, and evaluated. The best-performing model, LightGBM with hyperparameter tuning, achieved an accuracy of 96.94%, F1-score of 0.9694, and ROC-AUC of 0.9942. Principal Component Analysis (PCA) was applied to reduce dimensionality from 87 to 29 features while retaining 95% of variance, addressing multicollinearity and improving computational efficiency. To enhance model transparency and user trust, SHAP (SHapley Additive Explanations) values were computed to provide interpretable explanations for model predictions. The analysis revealed that domain reputation metrics, particularly google index and page rank, are the most influential features in distinguishing phishing from legitimate URLs. A web-based interface using Gradio was developed to enable real-time URL classification with explainable predictions. The research demonstrates that ensemble methods, particularly gradient boosting algorithms like LightGBM, outperform traditional classifiers for phishing detection. The integration of explainable AI techniques provides actionable insights into model decision-making, address-ing the “black box” problem common in machine learning security applications. This work contributes to the development of practical, scalable, and interpretable phishing detection systems suitable for real-time deployment in resource-constrained environments.

Library of Congress Subject Headings

Phishing--Prevention--Automation; Natural language processing (Computer science); Machine learning; Computer algorithms--Evaluation

Publication Date

12-2025

Document Type

Thesis

Student Type

Graduate

Degree Name

Professional Studies (MS)

Department, Program, or Center

Graduate Programs & Research

Advisor

Ioannis Karamitsos

Advisor/Committee Member

Sanjay Modak

Recommended Citation

Alrokhaimi, Khalid, "An Interpretable Machine Learning Framework for Detecting Phishing URLs Based on Lexical Features" (2025). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/12410

Campus

RIT Dubai

Plan Codes

PROFST-MS

Download

COinS

Theses

An Interpretable Machine Learning Framework for Detecting Phishing URLs Based on Lexical Features

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Recommended Citation

Campus

Plan Codes

Search

Browse

Author Corner

RIT Links

Theses

An Interpretable Machine Learning Framework for Detecting Phishing URLs Based on Lexical Features

Author

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Recommended Citation

Campus

Plan Codes

Share

Search

Browse

Author Corner

RIT Links