Telco
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
2. Analysis
In this section we analyse the telco
dataset. We start by reading
the accompanying data docs which is scarce.
2.1. Preliminary analysis
We start by loading the dataset and answering out initial set of questions.
telco = pd.read_csv('../data/data/telco.csv') telco.head()
customerID gender SeniorCitizen Partner Dependents tenure PhoneService \ 0 7590-VHVEG Female 0 Yes No 1 No 1 5575-GNVDE Male 0 No No 34 Yes 2 3668-QPYBK Male 0 No No 2 Yes 3 7795-CFOCW Male 0 No No 45 No 4 9237-HQITU Female 0 No No 2 Yes MultipleLines InternetService OnlineSecurity OnlineBackup \ 0 No phone service DSL No Yes 1 No DSL Yes No 2 No DSL Yes Yes 3 No phone service DSL Yes No 4 No Fiber optic No No DeviceProtection TechSupport StreamingTV StreamingMovies Contract \ 0 No No No No Month-to-month 1 Yes No No No One year 2 No No No No Month-to-month 3 Yes Yes No No One year 4 No No No No Month-to-month PaperlessBilling PaymentMethod MonthlyCharges TotalCharges \ 0 Yes Electronic check 29.85 29.85 1 No Mailed check 56.95 1889.5 2 Yes Mailed check 53.85 108.15 3 No Bank transfer (automatic) 42.30 1840.75 4 Yes Electronic check 70.70 151.65 Churn 0 No 1 No 2 Yes 3 No 4 Yes
telco.shape
7043 | 21 |
telco.dtypes
customerID object gender object SeniorCitizen int64 Partner object Dependents object tenure int64 PhoneService object MultipleLines object InternetService object OnlineSecurity object OnlineBackup object DeviceProtection object TechSupport object StreamingTV object StreamingMovies object Contract object PaperlessBilling object PaymentMethod object MonthlyCharges float64 TotalCharges object Churn object dtype: object
We have several categorical features which should be represented as
category
dtype. The customerID
column can be dropped since it
doesn't add any new information. TotalCharges
should be converted to
float
dtype. Let's also convert SeniorCitizen
to yes or no
format inline with other categorical features.
2.1.1. Handling customerID
telco = telco.drop('customerID', axis='columns') telco.shape
7043 | 20 |
2.1.2. Handling categorical features
telco.loc[telco['SeniorCitizen'].eq(0), 'SeniorCitizen'] = 'No' telco.loc[telco['SeniorCitizen'].eq(1), 'SeniorCitizen'] = 'Yes' telco['SeniorCitizen'] = telco['SeniorCitizen'].astype('category')
categorical_features = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn'] for feature in categorical_features: telco[feature] = telco[feature].str.strip().astype('category')
2.1.3. Handling TotalCharges
telco['TotalCharges'] = telco['TotalCharges'].str.strip() telco.loc[telco['TotalCharges'].apply(len).eq(0), 'TotalCharges'] = np.nan telco['TotalCharges'] = telco['TotalCharges'].astype('float')
telco.dtypes
gender category SeniorCitizen category Partner category Dependents category tenure int64 PhoneService category MultipleLines category InternetService category OnlineSecurity category OnlineBackup category DeviceProtection category TechSupport category StreamingTV category StreamingMovies category Contract category PaperlessBilling category PaymentMethod category MonthlyCharges float64 TotalCharges float64 Churn category dtype: object
2.1.4. Descriptive statistics, missing & duplicates
Let's look at the descriptive statistics, missing & duplicates next.
telco.describe(include='all')
gender SeniorCitizen Partner Dependents tenure PhoneService \ count 7043 7043 7043 7043 7043.000000 7043 unique 2 2 2 2 NaN 2 top Male No No No NaN Yes freq 3555 5901 3641 4933 NaN 6361 mean NaN NaN NaN NaN 32.371149 NaN std NaN NaN NaN NaN 24.559481 NaN min NaN NaN NaN NaN 0.000000 NaN 25% NaN NaN NaN NaN 9.000000 NaN 50% NaN NaN NaN NaN 29.000000 NaN 75% NaN NaN NaN NaN 55.000000 NaN max NaN NaN NaN NaN 72.000000 NaN MultipleLines InternetService OnlineSecurity OnlineBackup \ count 7043 7043 7043 7043 unique 3 3 3 3 top No Fiber optic No No freq 3390 3096 3498 3088 mean NaN NaN NaN NaN std NaN NaN NaN NaN min NaN NaN NaN NaN 25% NaN NaN NaN NaN 50% NaN NaN NaN NaN 75% NaN NaN NaN NaN max NaN NaN NaN NaN DeviceProtection TechSupport StreamingTV StreamingMovies \ count 7043 7043 7043 7043 unique 3 3 3 3 top No No No No freq 3095 3473 2810 2785 mean NaN NaN NaN NaN std NaN NaN NaN NaN min NaN NaN NaN NaN 25% NaN NaN NaN NaN 50% NaN NaN NaN NaN 75% NaN NaN NaN NaN max NaN NaN NaN NaN Contract PaperlessBilling PaymentMethod MonthlyCharges \ count 7043 7043 7043 7043.000000 unique 3 2 4 NaN top Month-to-month Yes Electronic check NaN freq 3875 4171 2365 NaN mean NaN NaN NaN 64.761692 std NaN NaN NaN 30.090047 min NaN NaN NaN 18.250000 25% NaN NaN NaN 35.500000 50% NaN NaN NaN 70.350000 75% NaN NaN NaN 89.850000 max NaN NaN NaN 118.750000 TotalCharges Churn count 7032.000000 7043 unique NaN 2 top NaN No freq NaN 5174 mean 2283.300441 NaN std 2266.771362 NaN min 18.800000 NaN 25% 401.450000 NaN 50% 1397.475000 NaN 75% 3794.737500 NaN max 8684.800000 NaN
telco.isna().any()
gender False SeniorCitizen False Partner False Dependents False tenure False PhoneService False MultipleLines False InternetService False OnlineSecurity False OnlineBackup False DeviceProtection False TechSupport False StreamingTV False StreamingMovies False Contract False PaperlessBilling False PaymentMethod False MonthlyCharges False TotalCharges True Churn False dtype: bool
Let's drop the missing values we introduced in TotalCharges
.
telco = telco.dropna() telco.shape
7032 | 20 |
telco[telco.duplicated()].shape
22 | 20 |
Let's investigate the duplicates.
telco[telco.duplicated(keep=False)]
gender SeniorCitizen Partner Dependents tenure PhoneService \ 22 Male No No No 1 Yes 100 Male No No No 1 Yes 542 Female No No No 1 Yes 646 Male No No No 1 Yes 662 Male No No No 1 Yes 690 Male No No No 1 Yes 964 Male No No No 1 Yes 976 Male No No No 1 Yes 1243 Male No No No 1 Yes 1338 Male No No No 1 Yes 1491 Female No No No 1 Yes 1731 Female Yes No No 1 Yes 1739 Male No No No 1 Yes 1932 Male No No No 1 Yes 2033 Male No No No 1 Yes 2121 Male No No No 1 Yes 2713 Male No No No 1 Yes 2892 Male No No No 1 Yes 3301 Female Yes No No 1 Yes 3312 Male No No No 1 Yes 3499 Female No No No 1 Yes 3679 Male No No No 1 Yes 3754 Male No No No 1 Yes 4098 Male No No No 1 Yes 4476 Female No No No 1 Yes 4495 Female No No No 1 Yes 4536 Male No No No 1 Yes 4817 Female No No No 1 Yes 5170 Female No No No 1 Yes 5506 Male No No No 1 Yes 5522 Female No No No 1 Yes 5736 Male No No No 1 Yes 5759 Female No No No 1 Yes 6267 Female No No No 1 Yes 6491 Female No No No 1 Yes 6499 Male No No No 1 Yes 6518 Male No No No 1 Yes 6609 Male No No No 1 Yes 6706 Female No No No 1 Yes 6764 Female No No No 1 Yes 6774 Female No No No 1 Yes 6924 Male No No No 1 Yes MultipleLines InternetService OnlineSecurity OnlineBackup \ 22 No No No internet service No internet service 100 No No No internet service No internet service 542 No No No internet service No internet service 646 No DSL No No 662 No No No internet service No internet service 690 No No No internet service No internet service 964 No DSL No No 976 No Fiber optic No No 1243 No DSL No No 1338 No No No internet service No internet service 1491 No No No internet service No internet service 1731 No Fiber optic No No 1739 No Fiber optic No No 1932 No No No internet service No internet service 2033 No No No internet service No internet service 2121 No No No internet service No internet service 2713 No No No internet service No internet service 2892 No No No internet service No internet service 3301 No Fiber optic No No 3312 No No No internet service No internet service 3499 No No No internet service No internet service 3679 No No No internet service No internet service 3754 No No No internet service No internet service 4098 No No No internet service No internet service 4476 No No No internet service No internet service 4495 No Fiber optic No No 4536 No Fiber optic No No 4817 No No No internet service No internet service 5170 No No No internet service No internet service 5506 No No No internet service No internet service 5522 No Fiber optic No No 5736 No No No internet service No internet service 5759 No Fiber optic No No 6267 No Fiber optic No No 6491 No Fiber optic No No 6499 No No No internet service No internet service 6518 No DSL No No 6609 No No No internet service No internet service 6706 No No No internet service No internet service 6764 No Fiber optic No No 6774 No No No internet service No internet service 6924 No Fiber optic No No DeviceProtection TechSupport StreamingTV \ 22 No internet service No internet service No internet service 100 No internet service No internet service No internet service 542 No internet service No internet service No internet service 646 No No No 662 No internet service No internet service No internet service 690 No internet service No internet service No internet service 964 No No No 976 No No No 1243 No No No 1338 No internet service No internet service No internet service 1491 No internet service No internet service No internet service 1731 No No No 1739 No No No 1932 No internet service No internet service No internet service 2033 No internet service No internet service No internet service 2121 No internet service No internet service No internet service 2713 No internet service No internet service No internet service 2892 No internet service No internet service No internet service 3301 No No No 3312 No internet service No internet service No internet service 3499 No internet service No internet service No internet service 3679 No internet service No internet service No internet service 3754 No internet service No internet service No internet service 4098 No internet service No internet service No internet service 4476 No internet service No internet service No internet service 4495 No No No 4536 No No No 4817 No internet service No internet service No internet service 5170 No internet service No internet service No internet service 5506 No internet service No internet service No internet service 5522 No No No 5736 No internet service No internet service No internet service 5759 No No No 6267 No No No 6491 No No No 6499 No internet service No internet service No internet service 6518 No No No 6609 No internet service No internet service No internet service 6706 No internet service No internet service No internet service 6764 No No No 6774 No internet service No internet service No internet service 6924 No No No StreamingMovies Contract PaperlessBilling PaymentMethod \ 22 No internet service Month-to-month No Mailed check 100 No internet service Month-to-month No Mailed check 542 No internet service Month-to-month No Mailed check 646 No Month-to-month Yes Mailed check 662 No internet service Month-to-month No Mailed check 690 No internet service Month-to-month Yes Mailed check 964 No Month-to-month Yes Mailed check 976 No Month-to-month Yes Electronic check 1243 No Month-to-month No Electronic check 1338 No internet service Month-to-month No Mailed check 1491 No internet service Month-to-month No Mailed check 1731 No Month-to-month Yes Electronic check 1739 No Month-to-month Yes Electronic check 1932 No internet service Month-to-month No Mailed check 2033 No internet service Month-to-month Yes Mailed check 2121 No internet service Month-to-month No Mailed check 2713 No internet service Month-to-month Yes Mailed check 2892 No internet service Month-to-month No Mailed check 3301 No Month-to-month Yes Electronic check 3312 No internet service Month-to-month No Mailed check 3499 No internet service Month-to-month No Mailed check 3679 No internet service Month-to-month Yes Mailed check 3754 No internet service Month-to-month No Mailed check 4098 No internet service Month-to-month Yes Mailed check 4476 No internet service Month-to-month No Mailed check 4495 No Month-to-month Yes Electronic check 4536 No Month-to-month Yes Electronic check 4817 No internet service Month-to-month No Mailed check 5170 No internet service Month-to-month No Mailed check 5506 No internet service Month-to-month No Mailed check 5522 No Month-to-month Yes Mailed check 5736 No internet service Month-to-month No Mailed check 5759 No Month-to-month Yes Mailed check 6267 No Month-to-month Yes Electronic check 6491 No Month-to-month Yes Electronic check 6499 No internet service Month-to-month No Mailed check 6518 No Month-to-month No Electronic check 6609 No internet service Month-to-month Yes Mailed check 6706 No internet service Month-to-month No Mailed check 6764 No Month-to-month Yes Electronic check 6774 No internet service Month-to-month No Mailed check 6924 No Month-to-month Yes Electronic check MonthlyCharges TotalCharges Churn 22 20.15 20.15 Yes 100 20.20 20.20 No 542 19.55 19.55 No 646 45.70 45.70 Yes 662 20.05 20.05 No 690 20.45 20.45 No 964 45.70 45.70 Yes 976 69.90 69.90 Yes 1243 45.30 45.30 Yes 1338 20.15 20.15 Yes 1491 19.55 19.55 No 1731 69.60 69.60 Yes 1739 69.90 69.90 Yes 1932 20.20 20.20 No 2033 20.20 20.20 Yes 2121 20.45 20.45 No 2713 20.45 20.45 No 2892 20.45 20.45 No 3301 69.60 69.60 Yes 3312 20.30 20.30 No 3499 20.90 20.90 Yes 3679 20.10 20.10 Yes 3754 20.05 20.05 No 4098 20.20 20.20 Yes 4476 20.90 20.90 Yes 4495 70.10 70.10 Yes 4536 69.35 69.35 Yes 4817 19.90 19.90 No 5170 19.65 19.65 No 5506 20.20 20.20 No 5522 70.15 70.15 Yes 5736 20.05 20.05 No 5759 70.15 70.15 Yes 6267 70.10 70.10 Yes 6491 69.20 69.20 Yes 6499 20.30 20.30 No 6518 45.30 45.30 Yes 6609 20.10 20.10 Yes 6706 19.90 19.90 No 6764 69.20 69.20 Yes 6774 19.65 19.65 No 6924 69.35 69.35 Yes
We see that the churn
feature varies for the duplicate entries.
Since we have so many categorical features, it is likely that the same
values occur for several customers. The same value for MonthlyCharges
& TotalCharges
is also possible since they can be on a fixed cost
plan/subscription. We don't have to drop the duplicates for this
dataset.
2.1.5. Correlations
name = 'heatmap@telco--corr.png' corr = telco.corr() plotter.corr(corr, name) name
tenure & MonthlyCharges
are positively correlated with
TotalCharges
which is expected.