Diabetes, a well-known medical condition since ancient times, has become a prevalent and significant health concern in recent decades. The rising incidence of diabetes has necessitated early diagnosis and effective treatment. Machine learning (ML) innovations have revolutionized disease prediction and decision-making by utilizing massive datasets. This study aims to develop and compare machine learning (ML) models for diabetes prediction using a preprocessed dataset of 532 instances obtained from Kaggle. Important variables included in the data set are Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, Age, and Outcome. The correlation analysis revealed a strong positive association between Glucose and Outcome, suggesting that elevated glucose levels are associated with an increased risk of diabetes. Similarly, Outcome and Age demonstrated a positive correlation, suggesting that age may be a risk factor. Six ML models, including Voting, Extra Trees, Bagging, Gradient Boosting, Logistic Regression (LR), and Random Forest Regression (RFR), were trained and optimized using Randomized Search CV for hyperparameter tuning. Using metrics such as Sensitivity, Specificity, Precision, Negative Predicted Value, and Accuracy to evaluate the models revealed the respective models' strengths and weaknesses. Both diabetic and non-diabetic cases had the highest predictive accuracy with the Extra Trees model. Additional feature significance analysis utilizing SHAP summary plots revealed that "Glucose" and "Age" were the most influential diabetes prediction features. These results highlight the diagnostic value of these characteristics. This investigation concludes with a thorough comparison of ML models for diabetes prediction. The findings demonstrate the potential of machine learning techniques for early disease detection and decision making. Analyses of the optimized models' performance and the significance of their features contribute valuable insights to the field of diabetes management. Future research may enlarge the dataset and investigate additional potent ML algorithms, thereby potentially improving the accuracy of predictions and facilitating personalized patient care.

Publication Date


Document Type

Master's Project

Student Type


Degree Name

Professional Studies (MS)

Department, Program, or Center

Graduate Programs & Research (Dubai)


Khalil Alhussaeni

Advisor/Committee Member

Sanjay Modak


RIT Dubai