Abstract
The current research examines one of the most effective approaches to spam email detection based on machine learning and natural language processing (NLP). The study is placed in the context of the rising cyber threats and the influx of emails, where the spam/ham data is to be classified correctfully using the combination of the Logistic Regression, NLP (including tokenization, lemmatization, and TF-IDF vectorization). The questions of the research were devoted to the efficiency of such an approach and the interpretation of its results. The data used are obtained by a publicly available Kaggle data set that contains 5,572 labeled email messages. The quantitative approach was applied, i.e. model training, evaluation, and visualization. The outcomes proved to be very accurate, precise, and had high rates of ROC-AUC proving the efficiency of the model. As conclusions show, when properly preprocessed, Logistic Regression can provide a low-cost but strongly performing method of carrying out spam-detection. Among suggestions, it is possible to calculate ensemble methods, class imbalance, and discover deep learning models in real-time implementation in further studies.
Publication Date
12-2025
Document Type
Thesis
Student Type
Graduate
Degree Name
Professional Studies (MS)
Department, Program, or Center
Graduate Programs & Research
Advisor
Sanjay Modak
Advisor/Committee Member
Ioannis Karamitsos
Recommended Citation
Almarri, Rashed, "Comparative Analysis of Machine Learning Models for Spam Email Detection" (2025). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/12395
Campus
RIT Dubai
