Initially I had almost 3 million data points and 140 features. I aimed to get to a smaller size of dataset with less than 50 features to process my next step.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2925493 entries, 0 to 2925492 Columns: 141 entries, id to debt_settlement_flag dtypes: float64(106), object(35) memory usage: 3.1+ GB '''
After doing data cleaning, EDA and feature engineering, I was able to cut down to 246,000 data points with 28 features. Here’s what I did:
-
drop the columns have missing values more than 30 % -> cut down to 104 features
-
check the Loan Dictionary for each columns definition and decided to drop or not -> cut down to 51 features
-
delete the row with missing values -> n size : 2925493 to 2657654
-
filter by Loan Status(target) as "Fully Paid" and "Charged Off" only. Discard the rest of loan status -> n size : 2657654 to 1711571
-
made a histogram on loan issued date and found out most of the records are between 2017 to 2020 (vs original issued date range was 2007 to 2020.) I assumed Lending Club is getting popular overtime, and I tend not to look into the loan data around 2008 financial crisis since the situation was different, so I only took recent 3 years loan data : 2018 Jan to 2020 Sep -> n : 246004
-
drop financial institution features as interest rate, grade, sub grade, fico score since these are likely created by financial professionals and I’d like to see my model perform without it first. Later I might add it back to see if it will improve my model since they are good predictors.
-
the relationship between feature "employment length' and target "Loan Status". Borrowers with 10+ year employment length tend to pay back their loan the most. On the other hand, it's also the majority group who charged off the loan.
-
create dummies for Catergorical features and applied feature engineer on "address state" - break into 4 major US regions instead of each state.
-
made a pairplot to observe the trend on numeric features.
I ran a simple Logistic Regression as my baseline model. In my project, I assume banks don't want to lend out money to borrowers who can’t pay back, so I will focus more on Recall than Precision. However, my baseline model Recall is only 0.02, but I’ll improve on my next model. Baseline model F1 score is 0.042. Test AUC: 0.626. See the following for Classification Report and Confusion Matrix:
As I making “Charged Off” on loan as positive class (1) and “Fully Paid” as negative class (0), the ratio is imbalance (2:8), so I added Weighted Class for Oversampling method to balance it. The scores improved: Recall increased to 0.53, F1 score is 0.42. Test AUC: 0.626. See the following for Classification Report and Confusion Matrix:
I plan to use Random Forest and XGBoost for more modeling.