Applying data equality-justice considerations to consumer default prediction model CAN enhance overall profitability

(Random Forest Analytics using Python Codes) When developing the default prediction model for consumer loans, data scientist will be required, first and foremost, to develop the model that accurately predicts future default. “Fairness” of the model may become subordinated to the accuracy. But it does not have to be. As a part of my class project for DATA 70500 at City University of New York Graduate Center and extending a highly accurate random forest model illustrated by an online article to create a “balanced” model.” What follows is a framework where “reduced form” model can still make the default prediction that are as good, or almost as good as the original model, while successfully improving the perceived fairness of the model. The result is still a high accuracy model, with improved possibilities for developing new consumer credit businesses.

THE ORIGINAL MODEL A well written and easy to read introductory article in Medium online magazine, “Predicting Loan Default Risk: A Hands-On guide with Python” illustrated the solid performance of a model based on a random forest method in predicting the default from the profile data. Some elements of “potentially discriminative” profile data, however, such as job descriptions, marriage status, and location of the residency, have been included as independent variables.

Predicting Loan Default Risk: A Hands-On Guide with Python | by Eulene | Medium

“REDUCED FORM” MODEL INCORPORATING LESS “DISCRIMINATORY DATA” AND ADDITIONAL PERFORMANCE EVALUATION MEASURES The article provides a step-by-step python coding to develop a random forest model to predict the likelihood of default amongst a given set of borrowers in India.  The author of the article states the primary purpose of the model is to “help the lenders make informed decisions and minimize potential losses” by “analyzing various factors like income credit history, and economic trends.” The data is obtained from Kaggle Datasets, titled “Understanding Applicant Details for Loan Approval in India” and is described as a “rich collection of variables capturing various aspects of loan applicants, including personal, financial, and demographic information. It includes features such as age, gender, marital status, employment details, income, loan amount, loan term, credit history, and more. Additionally, it provides the target variable indicating whether the loan was approved or not, making it suitable for classification tasks.

Applicant Details For Loan Approve

My question at this stage is: how important are the “discriminatory” factors in predicting the likelihood of defaults? I tried this exercise by removing Columns 8 (Occupation) and 5 (Marital Status) from the original dataset and running the same model, with the independent variables reduced by two. The rationale for eliminating marital status and occupation stems from when I was a young bank trainee about 45 years ago; my instructors had biases against mortgage borrowers based on their marital status and the industries they worked in. I believe these obvious biases are no longer accepted, but the Kaggle data on consumer loans suggests that they have not completely disappeared.

The revised model outcomes, when compared with the original model outputs, showed that:

  1. The models had essentially the same levels of accuracy in predicting defaults.
  2. The new model had a marginally higher risk of predicting “performing” customers who have “defaulted.”
  3. The new model was marginally better at providing loans to somewhat higher-risk customers who nonetheless “performed.” (Please refer to the table below)
 AUCConfusion Curve
 Area Under the CurveCorrectly Predicting Loan DefaultCorrectly Predicting Loan PerformingPredicting Default but Actually PerformingPredicting Performing but Actually Defaulted
Original Model97.66%9.09%84.66%2.55%3.69%
Model Data with Marital Status and Job Description taken out97.66%8.76%85.06%2.18%4.01%

The code for this updated model can be found at the link below.

https://colab.research.google.com/drive/1WVgu7F4U6HW23hNCCz6vq3vE0JO-VOuA?usp=sharing

So, in conclusion, we could strive to reduce discrimination while still achieving statistical excellence and strong business returns.

This experiment has demonstrated a clear potential for reducing the likelihood of discriminatory data being used in default prediction models, without significantly affecting their precision and reliability. This example illustrates that banks and other lending organizations can review their consumer loan default prediction models to act more socially responsibly. The negative impact of removing some independent variables from the models may prove to be quite limited, based on the example I presented above.

Leave a Reply

Your email address will not be published. Required fields are marked *