Logistic Regression Using StatsModels
Hi, I’m the Da Data Guy!
If you have been following along, I’ve created three articles about my Customer Churn Analysis using the Maven Analytics dataset. If you haven’t seen my previous articles on this analysis, you can find the data and my step-by-step tutorial here:
Part 1 — Cleaning and Transforming the dataset
Part 2 — Prediction and Probability
For my final analysis, I’ll be using logistic regression from the StatsModels.api library. If you’ve programmed in R, this package is similar.
Before we dig into the code, I’ve already combed through the features and have selected those that have a positive coefficient and P value < 0.05. This will keep the post short and direct. My advice to you is that you will want to group features by their datatypes and not to use too many features in your logistic regression model. You may notice that as you add or reduce the number of features, your coefficient will fluctuate.
Lets Begin!
My first step will be loading the libraries, and data set.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from patsy import dmatrices
#SNS Settings
sns.set(color_codes = True)
sns.set(style="whitegrid")
sns.set(rc={'figure.figsize':(10,10)})
sns.set_palette("Set3")
# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)# Import Data Setimport time
time_begin = time.time()
df = pd.read_csv("high_churn_list_model.csv") # data = pd.read_csv("census.csv")
print(f'Run time: {round(((time.time()-time_begin)/60), 3)} mins')
To verify it works, I’ve checked a sample of the data set and the customer churn count.
# Creating a sample
df.sample(2)# Checking to see how many have churned
df['Churn'].value_counts()
# 0 = did not exit
Next, we will assign our target variable (churn) and then run it against features that are similar in their data types.
# First groupy,X = dmatrices('Churn ~ Age + MonthlyCharge + np.log(Population)', data = df, return_type = 'dataframe')
For the first group, I use dmatrices, which comes from patsy. This is for selecting the features. The format of this code might look similar to R and it’s powerful. Using this reference guide, I’ve written my code to take the churn and assign it to subset y. I then selected Age, Monthly Charge, and the log value of Population and assign it to subset X. The reason I used the log value of Population is that it’s range of values is significantly greater than Age or Monthly Charge, which can skew the relationship.
The second and third groups are the same, except dmatrices looks to see if a feature is a category, if it is, then it automatically performs a one-hot-encode process.
# Second groupy2,X2 = dmatrices('Churn ~ PhoneService + InternetService + PaperlessBilling', data = df, return_type = 'dataframe')# Third Group
y3,X3 = dmatrices('Churn ~ Offer + PaymentMethod + InternetType ', data = df, return_type = 'dataframe')
Next, I will fit each group to the logistic regression function and fit it to the data.
# Assign each group to a logistic regression functionmod = sm.Logit(y,X)
mod2 = sm.Logit(y2,X2)
mod3 = sm.Logit(y3,X3)# Fit the data to each model
res = mod.fit()
res2 = mod2.fit()
res3 = mod3.fit()# Generate the summary of each model
res.summary()
res2.summary()
res3.summary()
Below is a sample of the first model’s summary. Because I used dmatrices, the formula automatically includes the intercept.
Now that I’ve generated all three of the logistic regression summaries, I will create a new data frame. My new data frame will contain the coefficients and their p values.
# Creating new data frame using the p values of each summary
pvalues = pd.concat([res3.pvalues[1:].T.to_frame(), res2.pvalues[1:].T.to_frame(),res.pvalues[1:].T.to_frame()],
ignore_index=False)# Reviewing the results and creating a new column called Features
pvalues.index.name = 'Features'
pvalues.reset_index()# Creating new data frame using the coefficients of each summary
params = pd.concat([res3.params[1:].T.to_frame(), res2.params[1:].T.to_frame(),res.params[1:].T.to_frame()],
ignore_index=False)# Reviewing the results and creating a new column called Features
params.index.name = 'Features'
params.reset_index()
The code below shows how I merged the two data frames using a left join on the Features category. I then filled NaN with zero and reset the index. After, I renamed the columns to be more concise.
# Merging params and pvalues on their Features and filling NaN with zero.
results = pd.merge(params, pvalues, how = "left", on = "Features",suffixes=("params","pvalues")).fillna(0).reset_index()
After, I renamed the columns to be more concise.
# Renaming columns and displaying results
results = results.rename(columns={'0params': 'Params', "0pvalues":'Pvalues'})
The final steps are to filter the data frame. I’ll filter by p values that are < 0.05 and that have a positive coefficient.
# Generating the final data frame with the filtersfinal = results.loc[(results['Pvalues'] < 0.05) & (results['Params'] >= 0.00)].reset_index(drop=True)
Next, I create a new column called Odds, which are the odds that the event will lead to a customer churning.
# Creating a new column called Oddsfinal['Odds'] = np.exp(final['Params'])# Creating a new column called percent using the log odds of the featurefinal['Percent'] = (final['Odds'] - 1)*100
As the final step, I will sort the data frame by their odds and reset the index, which will be dropped.
final.sort_values(by=['Odds'], ascending=False).reset_index(drop=True)
Breaking down the data
The following is my understanding of what the data is telling me and keeping everything equal. Because of this, we can assume:
- A customer who has internet service increases the odds of a customer churning by 265%
- A customer who accepted “offer E” increases the odds of a customer churning by 227%
- A customer who has Fiber Optic as their internet type, increases the odds of a customer churning by 150%
- A customer who mails their check as their payment method, increases the odds of a customer churning by 142%
- A customer who is signed up for paperless billing, increases the odds of churning by 83%
This completes the final step in my customer churn analysis.
You could combine features together to generate a new feature; which may be more influential. However, for this analysis, we are keeping everything separate and equal when reviewing the odds. You can also add the intercept to find the range of the odds, if you would like.
If you have found any of my Customer Churn articles helpful, please give me a follow or a few claps. I’m continuing to create articles with the goal of helping others and sharing tips and tricks along the way.