Telecom Churn Analysis — Cleaning & Tranforming Data

6 min readSep 5, 2022

Data Provided by Maven Analytics
Part 1 of 3
Link: https://github.com/DaDataGuy/Telecom-Churn-Analysis
Data set: https://www.mavenanalytics.io/data-playground

Getting started is the hardest part, so here is to my first Medium article and joining the Data Science community. I will be writing a three-part series on my Telecom Churn data set, which was provided by Maven Analytics. This article is part one: Cleaning & Transforming messy data.

The purpose of this article is to explain the process of how I transformed the data and some tricks to help you out. I’ll be the first to say I’m not perfect and if you find any mistakes or have a suggestion, please let me know in the comments.

Importing Libraries & Data Set

Using Jupyter Notebook, I always print the version of Python I’m using. This helps when I need to debug old code or recreate an environment.

import sys
print(“Python Version is: “ + sys.versionOutput:
Python Version is: 3.9.7 (default, Sep 16 2021, 16:59:28) 
[MSC v.1916 64 bit (AMD64)]

Loading the libraries and changing any data frame settings.

# importing libraries 
import pandas as pd
import numpy as np# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x) 
import time
#pd.set_option('display.max_rows', None)

Loading the data and printing off how long it takes to load. I do this for every file as a best practice. If loading the data takes long, I’ll know which file took the longest.

# Loading dataset labeled "Telecom Customer Churn" 
# Link: https://www.mavenanalytics.io/data-playgroundtime_begin = time.time()df = pd.read_csv("telecom_customer_churn.csv")print(f'Run time: {round(((time.time()-time_begin)/60), 3)} mins'Output: Run time: 0.001 mins

I’ll review its shape (number of rows by columns).

df.shape
Output: (6981, 37)

and the first two rows of the data frame.

df.head(2)

Review each column to learn it’s datatype and memory usage. I see there are 37 columns with various value counts per column.

df.info()Output: RangeIndex: 7043 entries, 0 to 7042
Data columns (total 37 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   CustomerID                     7043 non-null   object 
 1   Gender                         7043 non-null   object 
 2   Age                            7043 non-null   int64  
 3   Married                        7043 non-null   object 
 4   NumberofDependents             7043 non-null   int64  
 5   City                           7043 non-null   object 
 6   ZipCode                        7043 non-null   int64  
 7   Population                     7043 non-null   int64  
 8   NumberofReferrals              7043 non-null   int64  
 9   TenureinMonths                 7043 non-null   int64  
 10  Offer                          7043 non-null   object 
 11  PhoneService                   7043 non-null   object 
 12  AvgMonthlyLongDistanceCharges  6361 non-null   float64
 13  MultipleLines                  6361 non-null   object 
 14  InternetService                7043 non-null   object 
 15  InternetType                   5517 non-null   object 
 16  AvgMonthlyGBDownload           5517 non-null   float64
 17  OnlineSecurity                 5517 non-null   object 
 18  OnlineBackup                   5517 non-null   object 
 19  DeviceProtectionPlan           5517 non-null   object 
 20  PremiumTechSupport             5517 non-null   object 
 21  StreamingTV                    5517 non-null   object 
 22  StreamingMovies                5517 non-null   object 
 23  StreamingMusic                 5517 non-null   object 
 24  UnlimitedData                  5517 non-null   object 
 25  Contract                       7043 non-null   object 
 26  PaperlessBilling               7043 non-null   object 
 27  PaymentMethod                  7043 non-null   object 
 28  MonthlyCharge                  7043 non-null   float64
 29  TotalCharges                   7043 non-null   float64
 30  TotalRefunds                   7043 non-null   float64
 31  TotalExtraDataCharges          7043 non-null   int64  
 32  TotalLongDistanceCharges       7043 non-null   float64
 33  TotalRevenue                   7043 non-null   float64
 34  Churn                          7043 non-null   object 
 35  ChurnCategory                  1869 non-null   object 
 36  ChurnReason                    1869 non-null   object 
dtypes: float64(7), int64(7), object(23)
memory usage: 2.0+ MB

Let's look at the columns with NaN (null) values. I see that some columns have the same number of NaNs. This makes me think that they are related to a specific value within the data frame. HINT: The two columns might be Phone Service and Internet Service.

df.isnull().sum()
Output: CustomerID                          0
Gender                              0
Age                                 0
Married                             0
NumberofDependents                  0
City                                0
ZipCode                             0
Population                          0
NumberofReferrals                   0
TenureinMonths                      0
Offer                               0
PhoneService                        0
AvgMonthlyLongDistanceCharges     682
MultipleLines                     682
InternetService                     0
InternetType                     1526
AvgMonthlyGBDownload             1526
OnlineSecurity                   1526
OnlineBackup                     1526
DeviceProtectionPlan             1526
PremiumTechSupport               1526
StreamingTV                      1526
StreamingMovies                  1526
StreamingMusic                   1526
UnlimitedData                    1526
Contract                            0
PaperlessBilling                    0
PaymentMethod                       0
MonthlyCharge                       0
TotalCharges                        0
TotalRefunds                        0
TotalExtraDataCharges               0
TotalLongDistanceCharges            0
TotalRevenue                        0
Churn                               0
ChurnCategory                    5174
ChurnReason                      5174

I will pick two columns within each of the Phone Service & Internet Service.

df[df['MultipleLines'].isnull()].head(3)

df[df[‘StreamingMusic’].isnull()].head(3)When Internet Service = No then the Internet columns have NaN because it's not available.

I see from the above screenshots that when phone and internet service = No, specific columns have NaN. It would make sense that if a customer did not have internet service, their average monthly GB download would not be available.

I can assume the same is true for Churn Category and Churn Reason. If they did not churn, they wouldn’t have a reason nor a category for their churn.

Let’s start the cleaning!

The following code block will be large, but simple; I’m replacing binary columns with 1 or 0. This is because machine learning algorithms and logistic regression analysis requires numbers rather than text (strings).

# A preference of mine is to replace binary columns with 1,0. df['Churn']=df['Churn'].replace('Joined', 'Stayed')
df['Churn']=df['Churn'].replace(['Stayed','Churned'], [0,1])
df['Married']=df['Married'].replace(['Yes','No'],[1,0])
df['Gender'] = df['Gender'].replace(['Female','Male'],[0,1]) df['PaperlessBilling']=df['PaperlessBilling'].replace(['Yes','No'],[1,0])df['InternetService']=df['InternetService'].replace(['Yes','No'],[1,0])df['PhoneService']=df['PhoneService'].replace(['Yes','No'],[1,0])

Next, I’m going to fill the columns that are NaN and are numerical columns with 0. You might be wondering why not fill these with the mean per column? Well, the columns are already an average of each feature, so it doesn’t make sense to fill this in if they do not have phone or internet service.

df['AvgMonthlyGBDownload'] = df['AvgMonthlyGBDownload'].fillna(0)
df['AvgMonthlyLongDistanceCharges'] = df['AvgMonthlyLongDistanceCharges'].fillna(0)

Lastly, I will replace all of the columns that are text and have more than 2 unique values with “unknown”. I will not delete these columns because it would cause too much data loss and their unknown values could be valuable to the analysis.

cols_to_change= ['ChurnCategory','ChurnReason', 'UnlimitedData','StreamingMusic','StreamingMovies','StreamingTV',
'PremiumTechSupport','DeviceProtectionPlan','OnlineSecurity',
'OnlineBackup','InternetType','MultipleLines']df[cols_to_change]=df[cols_to_change].fillna('Unknown')

Checking columns for any NaN values. I want to see zeros across all columns.

df.isnull().sum()
Output:CustomerID                       0
Gender                           0
Age                              0
Married                          0
NumberofDependents               0
City                             0
ZipCode                          0
Population                       0
NumberofReferrals                0
TenureinMonths                   0
Offer                            0
PhoneService                     0
AvgMonthlyLongDistanceCharges    0
MultipleLines                    0
InternetService                  0
InternetType                     0
AvgMonthlyGBDownload             0
OnlineSecurity                   0
OnlineBackup                     0
DeviceProtectionPlan             0
PremiumTechSupport               0
StreamingTV                      0
StreamingMovies                  0
StreamingMusic                   0
UnlimitedData                    0
Contract                         0
PaperlessBilling                 0
PaymentMethod                    0
MonthlyCharge                    0
TotalCharges                     0
TotalRefunds                     0
TotalExtraDataCharges            0
TotalLongDistanceCharges         0
TotalRevenue                     0
Churn                            0
ChurnCategory                    0
ChurnReason                      0

Quick Data Exploration

df['Churn'].value_counts()
# 0 = did not exitOutput:
0    5174
1    1869

I will remove outliers because they will skew both the machine learning model and logistic regression analysis. This is not required but helps trim excess data that can cause issues further down the road.

I selected total revenue and population as the columns to remove outliers. Total Revenue is related to the other columns and by only looking at this column, it removes those who have anomalies in the other columns. I also selected population because if a customer lives in an unusually large city, it also means they have access to other telecom providers, who might have better pricing or service. It’s not worth the trouble to determine why a customer left in a large city.

#Removing outliers from two columns that encompass the entire data set.#note: these are positive standard deviations, not negative.from scipy import stats
df=df[(np.abs(stats.zscore(df['TotalRevenue'])) < 3)] 
df=df[(np.abs(stats.zscore(df['Population'])) < 3)]

How many were removed?

df['Churn'].value_counts()
# 0 = did not exitOutput:
0    5120
1    1861

After removing the outliers, there were 54 (1.04%) customers who did not churn and 8 (0.42%) customers who did churn.

Not bad. It seems the data set was balanced to begin with. You love to see that!

Exporting the cleaned data frame to a new CSV.

df.to_csv('CleanedDF.csv', index = False)

That’s it! I apologize in advance that I’ve not found an easier way to embed the results per code block.

Below is my entire Jupyter Notebook file that you can use to follow along. If you have any suggestions, recommendations or want to say thanks, feel free to drop it in the comments section.

I’ll be releasing part 2: Predicting the Churn Probability of a customer using XGBClassifier Machine Learning Algorithm soon.

Telecom Churn Analysis — Cleaning & Tranforming Data

Importing Libraries & Data Set

Let’s start the cleaning!

Quick Data Exploration

Exporting the cleaned data frame to a new CSV.

Written by Da Data Guy

No responses yet