Abstract

The current research examines one of the most effective approaches to spam email detection based on machine learning and natural language processing (NLP). The study is placed in the context of the rising cyber threats and the influx of emails, where the spam/ham data is to be classified correctfully using the combination of the Logistic Regression, NLP (including tokenization, lemmatization, and TF-IDF vectorization). The questions of the research were devoted to the efficiency of such an approach and the interpretation of its results. The data used are obtained by a publicly available Kaggle data set that contains 5,572 labeled email messages. The quantitative approach was applied, i.e. model training, evaluation, and visualization. The outcomes proved to be very accurate, precise, and had high rates of ROC-AUC proving the efficiency of the model. As conclusions show, when properly preprocessed, Logistic Regression can provide a low-cost but strongly performing method of carrying out spam-detection. Among suggestions, it is possible to calculate ensemble methods, class imbalance, and discover deep learning models in real-time implementation in further studies.

Publication Date

12-2025

Document Type

Thesis

Student Type

Graduate

Degree Name

Professional Studies (MS)

Department, Program, or Center

Graduate Programs & Research

Advisor

Sanjay Modak

Advisor/Committee Member

Ioannis Karamitsos

Campus

RIT Dubai

Share

COinS