Customer Churn Prediction & Probability Machine Learning Model

Da Data Guy
9 min readSep 8, 2022

--

Data Provided by Maven Analytics
Part: 2 of 3
Link: https://github.com/DaDataGuy/Telecom-Churn-Analysis
Telecom Customer Churn: https://www.mavenanalytics.io/data-playground

Has it crossed your mind how you can predict when a customer will leave or what is the probability of that customer leaving? It’s an important question that can be difficult to answer, but it doesn’t have to be. If you have access to your customer data and have marked who has churned and who has not; it’s possible to create your own model and deliver real results.

The image below is the final result of this churn predicition analysis and if your interested in creating your own, you can!

I suggest having over 10,000 records but if you don’t, that is okay. Be aware that your model will be a more sensitive to the changes in your data. For this walkthrough, I will be using a data set that has less than 10,000 records.

In my previous blog post, I talked about how to clean and transform your data in preparation for this machine learning model. If you haven’t done that, click the link and follow along. If you have already done that and your data is ready, lets dive in!

P.S. the link to this data set and my Jupyter Notebook file is listed at the top. Feel free to download and follow along.

My first task is to create a new Jupyter Notebook and start loading in the libraries and the CleanedDF.csv file.

import sys
print("Python Version is: " + sys.version)
output: Python Version is: 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]

I use a lot of libraries and if you don’t have all of them, you can either install them using anaconda or use “! pip install xgboost” in the cell block. If that is not an option, comment out the libraries you do not have, and you can make edits later.

import time
import pandas as pd
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import joblib #load model
import subprocess
from sklearn import feature_selection
#
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
#
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression,RidgeClassifier,SGDClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.preprocessing import OneHotEncoder , OrdinalEncoder,StandardScaler , MinMaxScaler, MaxAbsScaler
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import RFE
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, recall_score, confusion_matrix, classification_report,f1_score,accuracy_score
from sklearn.dummy import DummyClassifier
from catboost import CatBoostClassifier
#imblen learn
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline

# Get multiple outputs in the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
# goes to two decimal places

Next, I will load the file and print off how long it took to load.

import time
time_begin = time.time()
df = pd.read_csv("CleanedDF.csv") # data = pd.read_csv("census.csv")print(f'Run time: {round(((time.time()-time_begin)/60), 3)} mins')Output:
Run time: 0.001 mins

After loading all of the libraries and the cleanedDF data frame, I then assign the data frame, labeled df, to X and the target variable of Churn to y.

X = df
y = df['Churn']

I will now assign the test and train split of my data frame and the churn variable. If you have done this split before, you might be saying, “hey wait! you haven’t dropped the target variable (Churn) in the X data frame”. You are correct, but don’t worry, further down you will see why I did that.

As an additional note, stratify is an option I used to evenly distribute the target variable amongst y_train and y_test. I then chose to shuffle the data to further increase the odds of randomization and prevent overfitting.

X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.3,
random_state=42,
stratify=y,
shuffle = True
)

This next step is a manual process that unfortunately needs to happen. We will be splitting out the features as either numerical, categorical or features to be removed.

remove = ['CustomerID','ZipCode','City','ChurnCategory','ChurnReason','Churn']# Numerical columnsnum_feats = [ 
'Age',
'NumberofDependents',
'Population',
'NumberofReferrals',
'TenureinMonths',
'AvgMonthlyLongDistanceCharges',
'AvgMonthlyGBDownload',
'MonthlyCharge',
'TotalCharges',
'TotalRefunds',
'TotalExtraDataCharges',
'TotalLongDistanceCharges',
'TotalRevenue'
]
# Categorical columnscat_feats = [
'Gender',
'Offer',
'Married',
'PhoneService',
'MultipleLines',
'InternetService',
'InternetService',
'InternetType',
'OnlineSecurity',
'OnlineBackup',
'DeviceProtectionPlan',
'PremiumTechSupport',
'StreamingTV',
'StreamingMovies',
'StreamingMusic',
'UnlimitedData',
'Contract',
'PaperlessBilling',
'PaymentMethod',
]

I will now create a copy of X_test and then drop the target value. The reason for creating a copy before dropping it, is that we will match the probability of the churn prediction to the customer ID.

X_testcopy = X_test.copy()X_test.drop(remove, axis = 1, inplace = True)
X_train.drop(remove, axis = 1, inplace = True)

The next step will be creating a function that acts as my pipeline. A pipeline has grown in popularity and allows you to perform a series of data manipulation tasks. I’ve attached a link that I believe explains how to create a pipeline well.

def get_pipeline(X, model):numeric_pipeline = SimpleImputer(strategy='mean')
categorical_pipeline = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_pipeline, num_feats),
('categorical', categorical_pipeline, cat_feats),
], remainder='passthrough'
)
bundled_pipeline = imbpipeline(steps=[
('preprocessor', preprocessor),
('smote', SMOTE(random_state=42)),
('scaler', MinMaxScaler()),
('model', model)
])

return bundled_pipeline

PAUSE

Let’s break this down step-by-step.

  1. Create a function that accepts X and model inputs.
  2. Within the function, I am using the variables that I created above and assigning them to a transformation technique called SimpleImputer. Simple Imputer handles missing data and I’ve specified that it would replace the missing data with the mean of the column within the X_test and X_train data set.
  3. The same applies for this next step, which is using OneHotEncoder. One Hot Encoder takes categorical data, which I’ve specifically created, called “cat_feats” and creates columns for each unique row. You can learn more by clicking the link above.
  4. My preprocessor is taking the two columns that I’ve created and assigning a series of data transformation tasks. The two transformations are SimpleImputer that uses “num_feats” and OneHotEncoder that uses “cat_feats”.
  5. I will now create a pipeline that performs the preprocessor transformations, handles any imbalanced data using SMOTE , scales the numerical values within each column to reduce the effects of outliers and apply it to my model.

The next part is going to be long and honestly, it’s worth your time to learn how it works. My high-level overview is that this select_model function will take your X and y datasets and run it through the pipeline. The select_model function runs though the various models that I’ve specified and then returns the accuracy and run time in a data frame. I appoligize in advance for how this is formatted. At the end, I will provide the Jupyter Notebook code and that will present the code in each cell block in a cleaner fashion.

def select_model(X, y, pipeline=None):  
classifiers = {}
classifiers.update({"DummyClassifier": DummyClassifier(strategy='most_frequent')})
classifiers.update({"XGBClassifier": XGBClassifier(use_label_encoder=False,
eval_metric='logloss',
objective='binary:logistic',
)})
classifiers.update({"LGBMClassifier": LGBMClassifier()})
classifiers.update({"RandomForestClassifier": RandomForestClassifier()})
classifiers.update({"DecisionTreeClassifier": DecisionTreeClassifier()})
classifiers.update({"ExtraTreeClassifier": ExtraTreeClassifier()})
#classifiers.update({"ExtraTreesClassifier": ExtraTreeClassifier()})
classifiers.update({"AdaBoostClassifier": AdaBoostClassifier()})
classifiers.update({"KNeighborsClassifier": KNeighborsClassifier()})
classifiers.update({"RidgeClassifier": RidgeClassifier()})
classifiers.update({"SGDClassifier": SGDClassifier()})
classifiers.update({"BaggingClassifier": BaggingClassifier()})
classifiers.update({"BernoulliNB": BernoulliNB()})
classifiers.update({"SVC": SVC()})
classifiers.update({"CatBoostClassifier":CatBoostClassifier(silent=True)})

# Stacking
models = []
models = []
models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
models.append(('CatBoostClassifier', CatBoostClassifier(silent=True)))
models.append(('BaggingClassifier', BaggingClassifier()))
classifiers.update({"VotingClassifier (XGBClassifier, CatBoostClassifier, BaggingClassifier)": VotingClassifier(models)})
models = []
models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
models.append(('LGBMClassifier', LGBMClassifier()))
models.append(('CatBoostClassifier', CatBoostClassifier(silent=True)))
classifiers.update({"VotingClassifier (XGBClassifier, LGBMClassifier, CatBoostClassifier)": VotingClassifier(models)})

models = []
models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
models.append(('RandomForestClassifier', RandomForestClassifier()))
models.append(('DecisionTreeClassifier', DecisionTreeClassifier()))
classifiers.update({"VotingClassifier (XGBClassifier, RandomForestClassifier, DecisionTreeClassifier)": VotingClassifier(models)})
models = []
models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
models.append(('AdaBoostClassifier', AdaBoostClassifier()))
#models.append(('ExtraTreeClassifier', ExtraTreeClassifier()))
classifiers.update({"VotingClassifier (XGBClassifier, AdaBoostClassifier)": VotingClassifier(models)})

models = []
models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
#models.append(('ExtraTreesClassifier', ExtraTreesClassifier()))
classifiers.update({"VotingClassifier (XGBClassifier)": VotingClassifier(models)})

df_models = pd.DataFrame(columns=['model', 'run_time', 'accuracy'])
for key in classifiers:

start_time = time.time()
pipeline = get_pipeline(X_train, classifiers[key])

cv = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
row = {'model': key,
'run_time': format(round((time.time() - start_time)/60,2)),
'accuracy': cv.mean(),
}
df_models = df_models.append(row, ignore_index=True)

df_models = df_models.sort_values(by='accuracy', ascending=False)
return df_models

Let’s see how it did!

models = select_model(X_train, y_train)
models.sort_values(by=['accuracy','run_time'], ascending=False)

I can see the voting classifier, which has several algorithms stacked on another, performed the best. However, it did take the longest to run and if I were to apply this same logic to a larger data set, I would want to choose another model.

I’m selecting the XGBClassifier algorithm because it has the highest accuracy and the lowest run time. I will now start to build my initial model before performing any hyperparameter tuning. You may see other pipelines include the GridSearchCV parameter to perform the hyperparameter tuning and that is a good habit to get into. I did not do that because I wanted to keep this vanilla.

Building The Base Model to Test

I’m now going to:

  1. build the XGBClassifier
  2. Use the get_pipeline function to transform my dataset
  3. Fit the transformed dataset to X_train and y_train
  4. Take the bundled pipeline dataset and predict using the new X_test dataset to prevent overfitting
basemodel = XGBClassifier(use_label_encoder = False, eval_metric='logloss', objective='binary:logistic')bundled_pipeline = get_pipeline(X_train, basemodel)bundled_pipeline.fit(X_train, y_train)basemodel_y_pred = bundled_pipeline.predict(X_test)

After this has been completed, I will print the classification and confusion matrix.

print(classification_report(y_test, basemodel_y_pred))
print(confusion_matrix(y_test, basemodel_y_pred))

Confusion Matrix:

[1384] [153]
[ 185] [373]

Classificaiton Report:

It’s the final countdown!

Putting it all together, I will also add the models' predictions. Most tutorials skip this part or do not show how to extract the output of the model’s predictions. In my opinion, being able to share the model’s prediction to then analyze the data further is key to understanding the classifications that were created.

time_begin = time.time() #starts timer# Loan Modelmodel = XGBClassifier(
use_label_encoder = False, eval_metric='logloss', objective='binary:logistic', learning_rate = 0.1,
n_estimators = 1000, max_depth = 9, min_child_weight = 1, gamma = 0.4, colsample_bytree = 0.8,
subsample = 0.9, reg_alpha = 1, scale_pos_weight = 1)
model = get_pipeline(X_train,model)model.fit(X_train,y_train)y_pred = model.predict(X_test)# predict target probabilities
test_prob = model.predict_proba(X_test)[:,1]
test_pred = np.where(test_prob > 0.45, 1, 0) #sets the probability threshhold and can be tweaked# test set metrics
roc_auc_score(y_test, test_pred)
recall_score(y_test, test_pred)
confusion_matrix(y_test, test_pred)
print(classification_report(y_test,test_pred))print(f'Run time: {round(((time.time()-time_begin)/60), 3)} mins')# adding predictions and their probabilities to the original test Data frame
X_testcopy['predictions'] = test_pred
X_testcopy['pred_probabilities'] = test_prob
# Exporting the predictions to a new CSV labeled high_churn_listhigh_churn_list = X_testcopy[X_testcopy.pred_probabilities > 0.0].sort_values(by=['pred_probabilities'], ascending = False ).reset_index().drop(columns=['index'],axis=1)

Reviewing the updated classification and confusion matrix.

[1378] [159]
[179] [379]]

Export the results:

high_churn_list.to_csv('high_churn_list_model.csv', index = False)

The model is complete! Now what?

Let’s break down the confusion matrix and learn how it can help me or the business.

The model predicted:

  1. 1378 customers who did not churn and was correctly predicted
  2. 179 customers who churned but the model predicted that they did not
  3. 159 customers who did not churn but the model predicted that they did
  4. 379 customers who did churn and was correctly predicted

With this information, I can create a marketing program to potentially reduce the 159 customers from churning as they have the highest probability of leaving. This means the model found their characteristics to align with those who have churned, and I can try to prevent that from happening. Below is a screenshot of the top 12 customers who are most likely to churn.

This concludes the end of my Customer Churn Prediction Model using Python and Scikit-Learn. Thank you for making it to the end and if you have any comments, suggestions or want to say thanks, leave a comment below.

Have a great day! :)

--

--

Da Data Guy

Data Science enthusiast who’s passionate about uncovering the hidden stories that lie between data and business improvements.