Abstract

The current dissertation investigates the application of machine learning in diabetes prediction at early stages through a comparison of the performances of three classification models, including Logistic Regression, Decision Tree, and Random Forest. Inspired by the increased and spread cases of diabetes worldwide, the research objective is to facilitate early diagnosis by providing interpretable and accurate predictive models. Based on Pima Indians Diabetes Dataset containing 768 clinical records, the research implemented data preprocessing including KNN imputation and outlier processing, feature scaling, and formation of interaction features. Quantitative and comparative approach was made to train and test the models based on such metrics as accuracy, precision, recall, F1 score, and ROC-AUC. The Logistic Regression proved to have the optimal level of accuracy and interpretability (AUC 0.88), that is reasonable to be used in clinical practice. Random Forest showed the best predictive accuracy although it was not transparent. The most influential features identified in prediction were glucose, BMI and age. The research suggests the usage of Logistic Regression as the solution deployed in mobile and community-based screening devices. The future research should involve more significant and heterogeneous samples of data and implement interpretability techniques like SHAP or LIME and expand the framework to predict some other chronic conditions, including hypertension.

Library of Congress Subject Headings

Diabetes--Forecasting--Data processing; Diabetes--Diagnosis; Machine learning

Publication Date

12-2025

Document Type

Thesis

Student Type

Graduate

Department, Program, or Center

Graduate Programs & Research

Advisor

Sanjay Modak

Advisor/Committee Member

Khalil Al Hussaeni

Campus

RIT Dubai

Plan Codes

PROFST-MS

Share

COinS