Using Logistic Regression to Predict Personal Loan Purchase: A Classification Approach

In a previous post, I explored building a supervised machine learning model using linear regression to predict the price of used cars. In this post, I will use supervised learning with classification to see if I can successfully build a model to predict whether a liability customer will buy a personal loan or not from a bank.

Before we dive in, I think it i important to distinguish between these two approaches in supervised learning. As a reminder, in linear regression, the algorithm learns to identify the linear relationship between input variables and output variables. The goal is to find the best-fitting line that describes the relationship between the input variables and the output variables. This line is determined by minimizing the sum of the squared differences between the predicted values and the actual values. During training, the algorithm is provided with a set of input variables and their corresponding output labels. The algorithm uses this data to learn the relationship between the input and output variables. Once the algorithm has learned this relationship, it can use it to make predictions on new, unseen data.

In classification, the algorithm learns to identify patterns in the input data and assign each input data point to one of several possible categories. The goal is to find a decision boundary that separates the different categories as well as possible. During training, the algorithm is provided with a set of input variables and their corresponding output labels, which represent the categories to which the input data points belong. The algorithm uses this data to learn the relationship between the input variables and the output labels, and to find the decision boundary that best separates the different categories. Once the algorithm has learned this relationship, it can use it to make predictions on new, unseen data.

Let’s get started.

Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

We will attempt to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Data Dictionary

ID: Customer ID
Age: Customer’s age in completed years
Experience: #years of professional experience
Income: Annual income of the customer (in thousand dollars)
ZIP Code: Home Address ZIP code.
Family: the Family size of the customer
CCAvg: Average spending on credit cards per month (in thousand dollars)
Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
Mortgage: Value of house mortgage if any. (in thousand dollars)
Personal_Loan: Did this customer accept the personal loan offered in the last campaign?
Securities_Account: Does the customer have securities account with the bank?
CD_Account: Does the customer have a certificate of deposit (CD) account with the bank?
Online: Do customers use internet banking facilities?
CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)?

Methodology

We will start by following the same methodology as we did in our linear regression model:

Data Collection: Begin by collecting a dataset that contains the input features. This dataset will be split into a training set (used to train the model) and a testing set (used to evaluate the model’s performance).
Data Preprocessing: Clean and preprocess the data, addressing any missing values or outliers, and scaling the input features to ensure that they are on the same scale.
Model Training: Train the logistic regression model on the training dataset. This step involves finding the best-fitting line that minimizes the error between the actual and predicted purchase likelihood. Most programming languages, such as Python, R, or MATLAB, have built-in libraries that simplify this process.
Model Evaluation: Evaluate the model’s performance on the testing dataset by comparing its predictions to the actual loan purchases. Common evaluation metrics for classification models include:
1. Accuracy: The proportion of correctly classified instances to the total number of instances in the test set.
2. Precision: The proportion of true positives (correctly classified positive instances) to the total number of predicted positives (instances classified as positive).
3. Recall: The proportion of true positives to the total number of actual positives in the test set.
4. F1 score: The harmonic mean of precision and recall, which provides a balance between the two measures.
5. Area under the receiver operating characteristic curve (AUC-ROC): A measure of the performance of the algorithm at different threshold levels for classification. The AUC-ROC curve plots the true positive rate (recall) against the false positive rate (1-specificity) for different threshold levels.
6. Confusion matrix: A table that summarizes the actual and predicted classifications for each class. It provides information on the true positives, true negatives, false positives, and false negatives.
Model Optimization: If the model’s performance is unsatisfactory, consider feature engineering, adding more data, or using regularization techniques to improve the model’s accuracy.

The dataset used to build this model can be found by visiting my GitHub page.

Data Collection

We will start by importing all our required Python libraries:

	Accuracy	Recall	Precision	F1
	0.959724	0.647416	0.898734	0.75265

	Accuracy	Recall	Precision	F1
	0.915132	0.899696	0.530466	0.667418

	Accuracy	Recall	Precision	F1
	0.956847	0.753799	0.782334	0.767802

	Logistic Regression sklearn	Logistic Regression-0.12 Threshold	Logistic Regression-0.33 Threshold
Accuracy	0.959724	0.915132	0.956847
Recall	0.647416	0.899696	0.753799
Precision	0.898734	0.530466	0.782334
F1	0.752650	0.667418	0.767802

Using Logistic Regression to Predict Personal Loan Purchase: A Classification Approach

Background and Context

Data Dictionary

Methodology

Data Collection

Data Preprocessing, EDA, and Univariate/Multivariate Analysis

Model Building

Model using sklearn

Model Using Optimal Threshold of .12

Model Using Optimal Threshold of .33

Model Using sklearn

Model Using Optimal Threshold of .12

Model Using 0.33 Threshold

Leave a Reply Cancel reply