Abstract

Phishing attacks represent one of the most significant and persistent threats in the cybersecu- rity landscape, with attackers increasingly using sophisticated URL manipulation techniques to deceive users and steal sensitive information. Traditional detection methods, which rely primarily on blacklists and heuristic rules, struggle to identify zero-day phishing URLs that have not yet been catalogued in security databases. This research addresses this critical gap by developing an interpretable machine learning framework for detecting phishing URLs usingclexical, structural, content-based, and domain metadata features. The study employs a comprehensive dataset of 11,430 labeled URLs (5,715 legitimate and 5,715 phishing) with 87 extracted features, categorized into structural (49 features), content- based (15 features), and metadata (23 features) attributes. Through extensive exploratory data analysis, the research identifies key discriminative patterns between legitimate and phishing URLs, including URL length, domain age, page rank, and Google indexing status. Four machine learning algorithms—Decision Tree, Random Forest, Support Vector Machine (SVM), and LightGBM—were trained, optimized, and evaluated. The best-performing model, LightGBM with hyperparameter tuning, achieved an accuracy of 96.94%, F1-score of 0.9694, and ROC-AUC of 0.9942. Principal Component Analysis (PCA) was applied to reduce dimensionality from 87 to 29 features while retaining 95% of variance, addressing multicollinearity and improving computational efficiency. To enhance model transparency and user trust, SHAP (SHapley Additive Explanations) values were computed to provide interpretable explanations for model predictions. The analysis revealed that domain reputation metrics, particularly google index and page rank, are the most influential features in distinguishing phishing from legitimate URLs. A web-based interface using Gradio was developed to enable real-time URL classification with explainable predictions. The research demonstrates that ensemble methods, particularly gradient boosting algorithms like LightGBM, outperform traditional classifiers for phishing detection. The integration of explainable AI techniques provides actionable insights into model decision-making, address-ing the “black box” problem common in machine learning security applications. This work contributes to the development of practical, scalable, and interpretable phishing detection systems suitable for real-time deployment in resource-constrained environments.

Publication Date

12-2025

Document Type

Thesis

Student Type

Graduate

Degree Name

Professional Studies (MS)

Department, Program, or Center

Graduate Programs & Research

Advisor

Sanjay Modak

Advisor/Committee Member

Ioannis Karamitsos

Campus

RIT Dubai

Share

COinS