Building a churn model to understand why customers are leaving.
Key Takeaways 1. How to build customer churn model using various algorithms. 2. How to select the best performing model and figure out the important features impacting the model predictions.
Introduction
Customer churn, also known as customer turnover or defection, refers to the loss of customers from a company's customer base. It is a significant concern for businesses because it can significantly impact revenue and profitability.
Data science algorithms, such as support vector machines (SVM), logistic regression, artificial neural networks, random forests, etc., can be used to predict and prevent customer churn.
Nevertheless, one thing holds true for all industries:
It cost more to acquire new customers than it is to retain existing ones
Because it costs so much to acquire them, it is wise to work hard towards retaining them.
Methodology:
To effectively use the above-mentioned algorithms to prevent customer churn, a company should first gather data about its past churners and non-churners. This may include information about their demographics, purchasing history, and interactions with the company (e.g. customer service inquiries).
The company should then split the data into a training set and a test set, and use the training set to train one or more machine learning models. The models can then be tested on the test set to evaluate their performance and identify any areas for improvement.
Once the company has a trained and tested machine learning model, it can use the model to predict which customers are at risk of churning and take preventive action. This may include offering incentives to customers to encourage them to remain with the company, addressing any issues or concerns they may have, or proactively reaching out to at-risk customers to see how the company can better meet their needs.
By taking these steps, a company can significantly reduce its customer churn rate and improve its bottom line.
You can download the complete code from my GitHub Repo
Let's get our hands dirty!!
So, let’s say we have a dataset containing data about 10000 customers who are withdrawing their accounts from a bank. The data describe some attributes of these customers, such as the country they live, their credit score, age, balance, among others.
Our model should predict if the customer will churn or not. So our target variable will be 'exited'. We should analyze the data focusing on how the different features are related to the customer churn state.
But first, let’s see how many customers have churned.
We can observe that 20.37% of the customers have churned.
This information is valuable because for classification models we need to confirm that our dataset does not suffer from data imbalance, which usually reflects an unequal distribution of classes within a dataset.
Even though the class is not equally distributed, we can say that it does not suffer from high-class imbalance. After that, we can analyze the relationship between categorical variables and the target variable
From the plots, we learn several things:
The proportion of female customers that churn is greater than the proportion of male customers.
Most of the customers come from France. Nevertheless, Germany and Spain have a greater proportion of customers churning.
The proportion of inactive members that churn is higher than the proportion of active members.
Now, we can focus on the relationship between continuous variables and the target variable.
From the plots, we learn several things:
Customers that churn are older than those who are retained.
There is no difference in the median credit score or tenure between lost and retained customers.
Among the customers that churn, most of them seem to still have a significant balance in their bank account.
Neither estimated salary nor the number of products seems to have an effect on customer churn.
Feature Engineering (Time to create new features)
Some of the variables that we have, can be combined into new features that describe customers better.
1. creditscore_age_ratio:
We saw before that credit score had no effect on churning because credit score frequently increases with time (and consequently, with age), we’ll create a new feature to account for credit score behaviour by age.
The customers who are churning appear to have a smaller credit score by age ratio.
2. balance_salary_ratio:
We have seen that the estimated salary has no effect on the likelihood of a customer churning.
However, a feature that could be interesting to explore is the ratio between balance and salary because this can be an estimation of which percentage of their salary a customer spends, and could be a probable indicator of churning.
We have created new variables using age, credit score and balance. As a consequence, we are going to exclude them from the analysis. They will be correlated with our new variables.
3. Encode categorical variables:
Splitting dataset
In order to train and test our model, we need to split our dataset into two sub-datasets: the training and the test dataset.
It is common to use the rule of 80%-20% to split the original dataset.
It is important to use a reliable method to split the dataset to avoid data leakage; this is the presence in the test set of examples that were also in the training set and can cause overfitting.
Model Building (My fav part)
We are ready to build different models looking for the best fit. We’ll test:
Logistic regression classifier
Support Vector Machine with Radial basis function kernel
Random Forest
For each of these models we are going to follow these steps:
Parameter Search: We will determine the parameters and values that we want to search in each of our models. Then, we will perform the GridSearchCV and set the best parameters obtained in our model.
Best Models Fit: After finding the best estimator, we will train them using the training dataset.
Performance Evaluation: After training the best models with our training dataset, we are going to see how well they perform using our test set.
1. Logistic Regression: Logistic regression is a supervised machine learning algorithm that is often used for classification tasks. It creates a model that explains how independent variables contribute to a binary dependent variable.
In other words, it defines an equation with the features that are thought to impact churn and tries to estimate the best coefficient for each variable for customers who did or did not churn.
When we calculate the accuracy of the model, it comes out to be 81%. However, recall and f1-score are around 50% and there are a lot of false negatives, which is a situation we should avoid.
2. Support Vector Machines: This algorithm creates an n-dimensional space (n being the number of features used), with each customer represented as a point in that space.
In order to classify the points into one of two groups, the customer churning or not, it tries to find a hyperplane, with as large a margin as possible that separates both groups.
The SVM model generates a prediction for each data point and predicts whether the customer is in the churn group or not.
When we calculate the accuracy of the model, it comes out to be 83%. The recall and f1-score have improved from 50% to 64%. There are still a lot of false negatives.
3. Random Forests: Random Forest is an ensemble-based algorithm. It contains a large number of randomized decision trees.
Each of these decision trees will classify a point, and in the end, the majority vote of the decisions reached by all the trees is taken. This technique helps avoid overfitting.
When we calculate the accuracy of the model, it comes out to be 84%. The recall and f1-score have improved a little more to 66%. There are still a lot of false negatives.
Now, which algorithm to select? Hmmm...
This can be done by comparing ROC (receiver operating characteristic) curves of the 3 models.
We can further investigate the False positive rates and true positive rates using ROC Curve and calculate the AUC (area under the curve) which is also a metric of the prediction power of our model.
If the value is closer to 1 means that our model does a good job of differentiating a random sample into the two classes.
Comparing the AUC value for the three models, we see that the Random Forest performs better than the logistic regression model.
Feature Importance
We are going to analyze how the different features affect customer churn. For that, we are going to check the variable importance. That is to quantify how useful every variable is for our model.
This is an important analysis. We can identify what are the features that make a customer more likely to churn and plan targeted strategies. To better understand about feature importance, check out my Responsible AI blog.
For SVM, it is not possible to get the feature importance. As the algorithm works like a black box.
For Logistic Regression, we can use the function SelectFromModel from sklearn.feature_selection. This function will select the features based on importance weights.
We can see that gender, number of products, member status, and the credit score by age ratio are the four most important features to predict if a customer will churn.
On the other hand, getting the feature importance from RandomForest is easy. The scikit-learn implementation has the method .feature_importances_ which will inform us of the importance of each feature.
We can see that number of products, credit score by age ratio and member status are again the most important features to predict if a customer will churn. Also, the balance salary ratio is important in this case.
Conclusion:
The models show some room for improvement. Because we want to detect the customer that will churn, we should avoid false negatives.
From the feature importance, we can observe that number of products and credit score by age ratio are important for both models.
The fact that active members are leaving is worrying for the company, and it should turn on an alert.
댓글