Abstract

Credit risk assessment remains a very important part of financial institutions, particularly within the rapidly evolving digital lending environment. The research explores the effectiveness of five machine learning models—Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and Support Vector Machine—in predicting loan default using both financial indicators and categorical borrower attributes. The study is motivated by the growing availability of structured borrower data and the need to evaluate whether advanced machine learning algorithms can be implemented over conventional credit scoring methods. A publicly available L&T Vehicle Loan Default Prediction dataset comprising 233,154 borrower records and 41 structured attributes, the CRISP-DM framework was the methodological process, including data preparation, feature engineering, class imbalance handling using SMOTE, model training, and evaluation. Model performance was assessed using Accuracy, Precision, Recall, F1-score, ROC–AUC, and confusion matrix analysis to ensure robust evaluation under imbalanced class conditions. The findings indicate that financial leverage and credit behavior variables loan-to-value ratio (LTV), disbursed amount, credit bureau score (PERFORM_CNS.SCORE), and prior sanctioned amounts are the most influential predictors of loan default. Among the evaluated models, Gradient Boosting achieved the highest ROC–AUC (0.641), followed by Random Forest (0.637). But Gradient Boosting had weak minority-class detection. Random Forest was the most balanced classification performance because it had the least false positives in the detection of defaulters. Logistic Regression and Decision Tree had stronger recall for defaulters but had higher false-positive rates, while Support Vector Machine was the worst performing model, even after using SMOTE to handle the class imbalance. The use of categorical borrower attributes did not have more importance than core financial indicators to detect a defaulter. To enhance transparency, SHAP and LIME were applied for global and local interpretability. Both techniques consistently confirmed the dominance of leverage and credit history variables, and geography also plays a role in detecting a defaulter.

Publication Date

5-2026

Document Type

Thesis

Student Type

Graduate

Degree Name

Professional Studies (MS)

Department, Program, or Center

Graduate Programs & Research

Advisor

Khalil Al Hussaeni

Campus

RIT Dubai

Share

COinS