Abstract

The critical challenge of lending institutions and peer-to-peer investors is to single out high-risk active borrowers in order to avoid defaults and corresponding losses. This thesis is devoted to the development of a data-driven approach to predict loan defaults, which balances predictive accuracy with interpretability and cost-sensitive evaluation. Using a large dataset of loans from LendingClub, we applied the CRISP-DM methodology, performing extensive data preprocessing and exploratory analysis before training several machine learning models-logistic regression, decision tree, random forest, and XGBoost-to classify loans as default or non-default. Class imbalance was addressed through resampling, and models were tuned via cross-validation with a focus on maximizing the area under the ROC curve (ROC-AUC) and recall (sensitivity) to prioritize catching default cases. The results show that ensemble tree-based models significantly outperformed the baseline logistic model on predictive performance, yielding test ROC-AUC values of about 0.96–0.97. The best model, an ensemble, was able to capture approximately 85% of the loans that defaulted in the test set, a huge improvement compared with the traditional approach. For the model interpretability and to keep the modeling transparent, SHAP was used. The most influential factors were intuitive and included features describing loan repayment progress-for example, the proportion of principal repaid-loan grade and interest rate, which are indicative of the borrower’s credit quality-and loan amount, among others. Loans with very little principal repaid or higher risk grades were strongly associated with default outcomes, as would be expected from domain knowledge. By placing such emphasis on recall in model evaluation, the framework directly addresses the asymmetric cost of misclassification in lending-missing a default (false negative) is far more costly than a false alarm. The final model and its explanations together form a practical decision-support tool for lenders, which enables better-informed decisions regarding portfolio risk management and early warning for existing loans. In general, this thesis develops an interpretable and financially informed machine learning solution for loan default prediction that will help reduce default rates and support risk management in lending portfolios.

Library of Congress Subject Headings

Default (Finance)--Forecasting--Data processing; Bank loans--Data processing; Portfolio management--Data processing

Publication Date

12-2025

Document Type

Thesis

Student Type

Graduate

Degree Name

Professional Studies (MS)

Department, Program, or Center

Graduate Programs & Research

Advisor

Sanjay Modak

Advisor/Committee Member

Ayman Ibrahim

Campus

RIT Dubai

Plan Codes

PROFST-MS

Share

COinS