Abstract
Phishing attacks represent one of the most significant and persistent threats in the cybersecu- rity landscape, with attackers increasingly using sophisticated URL manipulation techniques to deceive users and steal sensitive information. Traditional detection methods, which rely primarily on blacklists and heuristic rules, struggle to identify zero-day phishing URLs that have not yet been catalogued in security databases. This research addresses this critical gap by developing an interpretable machine learning framework for detecting phishing URLs usingclexical, structural, content-based, and domain metadata features. The study employs a comprehensive dataset of 11,430 labeled URLs (5,715 legitimate and 5,715 phishing) with 87 extracted features, categorized into structural (49 features), content- based (15 features), and metadata (23 features) attributes. Through extensive exploratory data analysis, the research identifies key discriminative patterns between legitimate and phishing URLs, including URL length, domain age, page rank, and Google indexing status. Four machine learning algorithms—Decision Tree, Random Forest, Support Vector Machine (SVM), and LightGBM—were trained, optimized, and evaluated. The best-performing model, LightGBM with hyperparameter tuning, achieved an accuracy of 96.94%, F1-score of 0.9694, and ROC-AUC of 0.9942. Principal Component Analysis (PCA) was applied to reduce dimensionality from 87 to 29 features while retaining 95% of variance, addressing multicollinearity and improving computational efficiency. To enhance model transparency and user trust, SHAP (SHapley Additive Explanations) values were computed to provide interpretable explanations for model predictions. The analysis revealed that domain reputation metrics, particularly google index and page rank, are the most influential features in distinguishing phishing from legitimate URLs. A web-based interface using Gradio was developed to enable real-time URL classification with explainable predictions. The research demonstrates that ensemble methods, particularly gradient boosting algorithms like LightGBM, outperform traditional classifiers for phishing detection. The integration of explainable AI techniques provides actionable insights into model decision-making, address-ing the “black box” problem common in machine learning security applications. This work contributes to the development of practical, scalable, and interpretable phishing detection systems suitable for real-time deployment in resource-constrained environments.
Publication Date
12-2025
Document Type
Thesis
Student Type
Graduate
Degree Name
Professional Studies (MS)
Department, Program, or Center
Graduate Programs & Research
Advisor
Sanjay Modak
Advisor/Committee Member
Ioannis Karamitsos
Recommended Citation
Alrokhaimi, Khalid, "An Interpretable Machine Learning Framework for Detecting Phishing URLs Based on Lexical Features" (2025). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/12410
Campus
RIT Dubai
