Abstract

Diabetes, a well-known medical condition since ancient times, has become a prevalent and significant health concern in recent decades. The rising incidence of diabetes has necessitated early diagnosis and effective treatment. Machine learning (ML) innovations have revolutionized disease prediction and decision-making by utilizing massive datasets. This study aims to develop and compare machine learning (ML) models for diabetes prediction using a preprocessed dataset of 532 instances obtained from Kaggle. Important variables included in the data set are Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, Age, and Outcome. The correlation analysis revealed a strong positive association between Glucose and Outcome, suggesting that elevated glucose levels are associated with an increased risk of diabetes. Similarly, Outcome and Age demonstrated a positive correlation, suggesting that age may be a risk factor. Six ML models, including Voting, Extra Trees, Bagging, Gradient Boosting, Logistic Regression (LR), and Random Forest Regression (RFR), were trained and optimized using Randomized Search CV for hyperparameter tuning. Using metrics such as Sensitivity, Specificity, Precision, Negative Predicted Value, and Accuracy to evaluate the models revealed the respective models' strengths and weaknesses. Both diabetic and non-diabetic cases had the highest predictive accuracy with the Extra Trees model. Additional feature significance analysis utilizing SHAP summary plots revealed that "Glucose" and "Age" were the most influential diabetes prediction features. These results highlight the diagnostic value of these characteristics. This investigation concludes with a thorough comparison of ML models for diabetes prediction. The findings demonstrate the potential of machine learning techniques for early disease detection and decision making. Analyses of the optimized models' performance and the significance of their features contribute valuable insights to the field of diabetes management. Future research may enlarge the dataset and investigate additional potent ML algorithms, thereby potentially improving the accuracy of predictions and facilitating personalized patient care.

Publication Date

9-2023

Document Type

Master's Project

Student Type

Graduate

Degree Name

Professional Studies (MS)

Department, Program, or Center

Graduate Programs & Research (Dubai)

Advisor

Khalil Alhussaeni

Advisor/Committee Member

Sanjay Modak

Campus

RIT Dubai

Share

COinS