Telecom Churn Analysis — Cleaning & Tranforming Data
Data Provided by Maven Analytics
Part 1 of 3
Link: https://github.com/DaDataGuy/Telecom-Churn-Analysis
Data set: https://www.mavenanalytics.io/data-playground
Getting started is the hardest part, so here is to my first Medium article and joining the Data Science community. I will be writing a three-part series on my Telecom Churn data set, which was provided by Maven Analytics. This article is part one: Cleaning & Transforming messy data.
The purpose of this article is to explain the process of how I transformed the data and some tricks to help you out. I’ll be the first to say I’m not perfect and if you find any mistakes or have a suggestion, please let me know in the comments.
Importing Libraries & Data Set
Using Jupyter Notebook, I always print the version of Python I’m using. This helps when I need to debug old code or recreate an environment.
import sys
print(“Python Version is: “ + sys.versionOutput:
Python Version is: 3.9.7 (default, Sep 16 2021, 16:59:28)
[MSC v.1916 64 bit (AMD64)]
Loading the libraries and changing any data frame settings.
# importing libraries
import pandas as pd
import numpy as np# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
import time
#pd.set_option('display.max_rows', None)
Loading the data and printing off how long it takes to load. I do this for every file as a best practice. If loading the data takes long, I’ll know which file took the longest.
# Loading dataset labeled "Telecom Customer Churn"
# Link: https://www.mavenanalytics.io/data-playgroundtime_begin = time.time()df = pd.read_csv("telecom_customer_churn.csv")print(f'Run time: {round(((time.time()-time_begin)/60), 3)} mins'Output: Run time: 0.001 mins
I’ll review its shape (number of rows by columns).
df.shape
Output: (6981, 37)
and the first two rows of the data frame.
df.head(2)
Review each column to learn it’s datatype and memory usage. I see there are 37 columns with various value counts per column.
df.info()Output: RangeIndex: 7043 entries, 0 to 7042
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 7043 non-null object
1 Gender 7043 non-null object
2 Age 7043 non-null int64
3 Married 7043 non-null object
4 NumberofDependents 7043 non-null int64
5 City 7043 non-null object
6 ZipCode 7043 non-null int64
7 Population 7043 non-null int64
8 NumberofReferrals 7043 non-null int64
9 TenureinMonths 7043 non-null int64
10 Offer 7043 non-null object
11 PhoneService 7043 non-null object
12 AvgMonthlyLongDistanceCharges 6361 non-null float64
13 MultipleLines 6361 non-null object
14 InternetService 7043 non-null object
15 InternetType 5517 non-null object
16 AvgMonthlyGBDownload 5517 non-null float64
17 OnlineSecurity 5517 non-null object
18 OnlineBackup 5517 non-null object
19 DeviceProtectionPlan 5517 non-null object
20 PremiumTechSupport 5517 non-null object
21 StreamingTV 5517 non-null object
22 StreamingMovies 5517 non-null object
23 StreamingMusic 5517 non-null object
24 UnlimitedData 5517 non-null object
25 Contract 7043 non-null object
26 PaperlessBilling 7043 non-null object
27 PaymentMethod 7043 non-null object
28 MonthlyCharge 7043 non-null float64
29 TotalCharges 7043 non-null float64
30 TotalRefunds 7043 non-null float64
31 TotalExtraDataCharges 7043 non-null int64
32 TotalLongDistanceCharges 7043 non-null float64
33 TotalRevenue 7043 non-null float64
34 Churn 7043 non-null object
35 ChurnCategory 1869 non-null object
36 ChurnReason 1869 non-null object
dtypes: float64(7), int64(7), object(23)
memory usage: 2.0+ MB
Let's look at the columns with NaN (null) values. I see that some columns have the same number of NaNs. This makes me think that they are related to a specific value within the data frame. HINT: The two columns might be Phone Service and Internet Service.
df.isnull().sum()
Output: CustomerID 0
Gender 0
Age 0
Married 0
NumberofDependents 0
City 0
ZipCode 0
Population 0
NumberofReferrals 0
TenureinMonths 0
Offer 0
PhoneService 0
AvgMonthlyLongDistanceCharges 682
MultipleLines 682
InternetService 0
InternetType 1526
AvgMonthlyGBDownload 1526
OnlineSecurity 1526
OnlineBackup 1526
DeviceProtectionPlan 1526
PremiumTechSupport 1526
StreamingTV 1526
StreamingMovies 1526
StreamingMusic 1526
UnlimitedData 1526
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharge 0
TotalCharges 0
TotalRefunds 0
TotalExtraDataCharges 0
TotalLongDistanceCharges 0
TotalRevenue 0
Churn 0
ChurnCategory 5174
ChurnReason 5174
I will pick two columns within each of the Phone Service & Internet Service.
df[df['MultipleLines'].isnull()].head(3)
df[df[‘StreamingMusic’].isnull()].head(3)When Internet Service = No then the Internet columns have NaN because it's not available.
I see from the above screenshots that when phone and internet service = No, specific columns have NaN. It would make sense that if a customer did not have internet service, their average monthly GB download would not be available.
I can assume the same is true for Churn Category and Churn Reason. If they did not churn, they wouldn’t have a reason nor a category for their churn.
Let’s start the cleaning!
The following code block will be large, but simple; I’m replacing binary columns with 1 or 0. This is because machine learning algorithms and logistic regression analysis requires numbers rather than text (strings).
# A preference of mine is to replace binary columns with 1,0. df['Churn']=df['Churn'].replace('Joined', 'Stayed')
df['Churn']=df['Churn'].replace(['Stayed','Churned'], [0,1])
df['Married']=df['Married'].replace(['Yes','No'],[1,0])
df['Gender'] = df['Gender'].replace(['Female','Male'],[0,1]) df['PaperlessBilling']=df['PaperlessBilling'].replace(['Yes','No'],[1,0])df['InternetService']=df['InternetService'].replace(['Yes','No'],[1,0])df['PhoneService']=df['PhoneService'].replace(['Yes','No'],[1,0])
Next, I’m going to fill the columns that are NaN and are numerical columns with 0. You might be wondering why not fill these with the mean per column? Well, the columns are already an average of each feature, so it doesn’t make sense to fill this in if they do not have phone or internet service.
df['AvgMonthlyGBDownload'] = df['AvgMonthlyGBDownload'].fillna(0)
df['AvgMonthlyLongDistanceCharges'] = df['AvgMonthlyLongDistanceCharges'].fillna(0)
Lastly, I will replace all of the columns that are text and have more than 2 unique values with “unknown”. I will not delete these columns because it would cause too much data loss and their unknown values could be valuable to the analysis.
cols_to_change= ['ChurnCategory','ChurnReason', 'UnlimitedData','StreamingMusic','StreamingMovies','StreamingTV',
'PremiumTechSupport','DeviceProtectionPlan','OnlineSecurity',
'OnlineBackup','InternetType','MultipleLines']df[cols_to_change]=df[cols_to_change].fillna('Unknown')
Checking columns for any NaN values. I want to see zeros across all columns.
df.isnull().sum()
Output:CustomerID 0
Gender 0
Age 0
Married 0
NumberofDependents 0
City 0
ZipCode 0
Population 0
NumberofReferrals 0
TenureinMonths 0
Offer 0
PhoneService 0
AvgMonthlyLongDistanceCharges 0
MultipleLines 0
InternetService 0
InternetType 0
AvgMonthlyGBDownload 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtectionPlan 0
PremiumTechSupport 0
StreamingTV 0
StreamingMovies 0
StreamingMusic 0
UnlimitedData 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharge 0
TotalCharges 0
TotalRefunds 0
TotalExtraDataCharges 0
TotalLongDistanceCharges 0
TotalRevenue 0
Churn 0
ChurnCategory 0
ChurnReason 0
Quick Data Exploration
df['Churn'].value_counts()
# 0 = did not exitOutput:
0 5174
1 1869
I will remove outliers because they will skew both the machine learning model and logistic regression analysis. This is not required but helps trim excess data that can cause issues further down the road.
I selected total revenue and population as the columns to remove outliers. Total Revenue is related to the other columns and by only looking at this column, it removes those who have anomalies in the other columns. I also selected population because if a customer lives in an unusually large city, it also means they have access to other telecom providers, who might have better pricing or service. It’s not worth the trouble to determine why a customer left in a large city.
#Removing outliers from two columns that encompass the entire data set.#note: these are positive standard deviations, not negative.from scipy import stats
df=df[(np.abs(stats.zscore(df['TotalRevenue'])) < 3)]
df=df[(np.abs(stats.zscore(df['Population'])) < 3)]
How many were removed?
df['Churn'].value_counts()
# 0 = did not exitOutput:
0 5120
1 1861
After removing the outliers, there were 54 (1.04%) customers who did not churn and 8 (0.42%) customers who did churn.
Not bad. It seems the data set was balanced to begin with. You love to see that!
Exporting the cleaned data frame to a new CSV.
df.to_csv('CleanedDF.csv', index = False)
That’s it! I apologize in advance that I’ve not found an easier way to embed the results per code block.
Below is my entire Jupyter Notebook file that you can use to follow along. If you have any suggestions, recommendations or want to say thanks, feel free to drop it in the comments section.
I’ll be releasing part 2: Predicting the Churn Probability of a customer using XGBClassifier Machine Learning Algorithm soon.