Abstract

Student dropout remains a persistent challenge in higher education, undermining institutional performance, reducing workforce preparedness, and limiting students’ academic and economic opportunities. Accurately identifying students at risk of attrition is complex, due to the interplay of academic, financial, and behavioral factors. This thesis addresses this challenge by applying a combined machine learning framework—integrating both unsupervised and supervised techniques—to predict student dropout using structured, first-year academic and financial data. The study utilizes a comprehensive dataset of 4,424 undergraduate student records from a European higher education institution, covering ten academic years and comprising 35 variables related to academic performance, enrollment behavior, and financial engagement. Clustering techniques were employed to group students by engagement profiles, while classification models—including Random Forest, XGBoost, and a soft voting ensemble—were trained to predict final academic outcomes: Dropout, Enrolled, or Graduate. Feature engineering was conducted in two phases, with both semester-averaged metrics and advanced derived indicators used to enhance model performance. Findings show that academic approvals, grades, and tuition fee status are the most influential predictors of student outcomes. Unsupervised clustering revealed behaviorally distinct groups with statistically significant dropout risks, though these clusters did not translate effectively into predictive labels. Supervised models, particularly tuned XGBoost and ensemble classifiers, achieved high performance in binary classification tasks (balanced accuracy ¿ 0.91, AUC ¿ 0.95), confirming that dropout risk can be reliably predicted from early academic records. However, multiclass classification performance declined, especially for the transitional “Enrolled” category, highlighting the limitations of static early-year data in capturing more ambiguous student states. This research contributes to the literature by demonstrating the strengths and constraints of interpretable machine learning in modeling student success. It also offers actionable insights for academic institutions, such as prioritizing interventions for students with early signs of disengagement and financial instability. Methodologically, the study highlights opportunities for future work to explore hybrid clustering-classification models, apply soft clustering techniques, and evaluate deep learning models for benchmarking purposes. While complex models may lack interpretability, they can serve as useful baselines to understand performance ceilings within structured educational datasets.

Library of Congress Subject Headings

College dropouts--Forecasting--Data processing; College dropouts--Prevention; Machine learning; Predictive analytics

Publication Date

5-2025

Document Type

Thesis

Student Type

Graduate

Degree Name

Professional Studies (MS)

Department, Program, or Center

Graduate Programs & Research

Advisor

Sanjay Modak

Advisor/Committee Member

Ioannis Karamitsos

Campus

RIT Dubai

Plan Codes

PROFST-MS

Share

COinS