Abstract
The critical challenge of lending institutions and peer-to-peer investors is to single out high-risk active borrowers in order to avoid defaults and corresponding losses. This thesis is devoted to the development of a data-driven approach to predict loan defaults, which balances predictive accuracy with interpretability and cost-sensitive evaluation. Using a large dataset of loans from LendingClub, we applied the CRISP-DM methodology, performing extensive data preprocessing and exploratory analysis before training several machine learning models-logistic regression, decision tree, random forest, and XGBoost-to classify loans as default or non-default. Class imbalance was addressed through resampling, and models were tuned via cross-validation with a focus on maximizing the area under the ROC curve (ROC-AUC) and recall (sensitivity) to prioritize catching default cases. The results show that ensemble tree-based models significantly outperformed the baseline logistic model on predictive performance, yielding test ROC-AUC values of about 0.96–0.97. The best model, an ensemble, was able to capture approximately 85% of the loans that defaulted in the test set, a huge improvement compared with the traditional approach. For the model interpretability and to keep the modeling transparent, SHAP was used. The most influential factors were intuitive and included features describing loan repayment progress-for example, the proportion of principal repaid-loan grade and interest rate, which are indicative of the borrower’s credit quality-and loan amount, among others. Loans with very little principal repaid or higher risk grades were strongly associated with default outcomes, as would be expected from domain knowledge. By placing such emphasis on recall in model evaluation, the framework directly addresses the asymmetric cost of misclassification in lending-missing a default (false negative) is far more costly than a false alarm. The final model and its explanations together form a practical decision-support tool for lenders, which enables better-informed decisions regarding portfolio risk management and early warning for existing loans. In general, this thesis develops an interpretable and financially informed machine learning solution for loan default prediction that will help reduce default rates and support risk management in lending portfolios.
Library of Congress Subject Headings
Default (Finance)--Forecasting--Data processing; Bank loans--Data processing; Portfolio management--Data processing
Publication Date
12-2025
Document Type
Thesis
Student Type
Graduate
Degree Name
Professional Studies (MS)
Department, Program, or Center
Graduate Programs & Research
Advisor
Sanjay Modak
Advisor/Committee Member
Ayman Ibrahim
Recommended Citation
Almarri, Humaid Sultan, "Predicting Loan Defaults: A Behavioral Scoring Approach for Portfolio Risk Monitoring in Banking and P2P Lending" (2025). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/12439
Campus
RIT Dubai
Plan Codes
PROFST-MS
