The purpose of the model I plan to build is to predict whether an individual borrower will repay their loan (or not) given various data points and information from a historical database.
So in the future, when we have a new potential borrower coming to us, we can predict if this person will pay back or not. This will help with the decision making framework for whethe to lend to the borrower or not.
My client, a lending institution, provides loans to the borrower and charges interest. The borrower is obligate to pay interest periodically, but also pay back the full principal of the loan at the end. If the borrower repays the loan, the lender will receive the full interest rate and their initial capital they loaned. If the borrower defaults at any point in time, the lender would lose the entire principal of the loan as well as interest payments not yet received. This would be a large impact to the lender's profit and capital base.
With this prediction model, my client will benefit from being able to predict the risk of the borrower on whether they will pay back the loan and interest or not.
Lowering the risk of borrower defaults using the model, the lender will face a lower % of its borrowers defaulting and thus help the business to earn higher profits and maintain more customer relationships.
This dataset is provide by Lending Club and it’s downloaded from Kaggle.
-
It's a large dataset ( 2,925,493 datapoints with 140+ features) so I plan to cut down the size and only use the most recently 3 years data for my analysis.
Features including:
-
Variables:
loan amount, employee title, employee length, home ownership, annual income, debt to income ratio, loan purpose, FICO, loan grades from Lending Club, interest rate, installments, address state, etc.
-
Target:
Loan Status, whether Charged Off or Fully Paid, will be the target I aim to predict.
-
-
It’s a detailed descriptions table for all features/columns from the dataset. See below for some examples:
- modeling
- from sklearn.model_selection import train_test_split
- from sklearn.preprocessing import StandardScaler
- from sklearn.linear_model import Logisticregression
- from sklearn.tree import DecisionTreeClassifier
- from sklearn.ensemble import RandomForestClassifier
- from sklearn.neighbors import KNeighborsClassifier
- from sklearn.metrics import roc_auc_score
- evaluation
- from sklearn.metrics import classification_report,confusion_matrix
- import numpy as np
- import pandas as pd
-
pairplot
- import matplotlib.pyplot as plt %matplotlib inline
-
mapping
- Tableau
✨A histogram to show the relationship between Employment Length and Loan Status (repay or default)✨
✨A baseline model on Logistic Regression✨