HOW TO READ THIS NOTEBOOK:
This notebook has several sections:
Util functions that generate plots, train models and print out statistics
Feel free to skip these ones unless you’re interested in-deep on how results are generated
NOTE: Some of the util functions I took from the open-source code (e.g. print ROC curve) and others are written by myself (and a mix of that)
Execution of functions
The output of the analysis fitted with provided test dataset
My comments
I used python docstring to write down my thoughts through the analysis and model engineering and to describe results
I used # to comment on some code for readability
I split all processes into four sections: Explanatory Analysis, Data Processing, Model Engineering, Create Prediction.
In [1]:
#install required packages if needed:
#!pip install ipykernel notebook numpy matplotlib pandas sklearn seaborn graphviz
#print versions
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.pipeline import Pipeline
from utils import *
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier , plot_tree
from xgboost import plot_tree as plot_tree_xgb
from sklearn.linear_model import LogisticRegression
import xgboost
from xgboost import XGBClassifier
#disable warnings for viz
import warnings
warnings.filterwarnings('ignore')
print('Python: ',sys.version)
print('pandas: ',pd.__version__)
print('sklearn: ',sklearn.__version__)
print('xgboost: ',xgboost.__version__)
Python: 3.6.13 |Anaconda, Inc.| (default, Feb 23 2021, 21:15:04)
[GCC 7.3.0]
pandas: 1.1.5
sklearn: 0.24.1
xgboost: 1.4.1
In [2]:
#provide path to the input data:
output_file_name = 'predictions_churn_dataset_test.csv'
#train dataset:
path_to_train = 'input/churn_dataset_train.csv'
#random seed
RANDOM_SEED = 42
In [3]:
#1 - EXPLONATORY ANALYSIS
#Read training dataset, print preview:
df_input = pd.read_csv(path_to_train)
print('Input shape:\n\n\n train:', df_input.shape)
Input shape:
train: (3500, 20)
In [4]:
print('Train dataset:')
df_input.head()
Out[4]:
state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | total_eve_minutes | total_eve_calls | total_eve_charge | total_night_minutes | total_night_calls | total_night_charge | total_intl_minutes | total_intl_calls | total_intl_charge | number_customer_service_calls | churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AR | 173 | area_code_408 | no | no | 0 | 154.6 | 81 | 26.28 | 147.3 | 100.0 | 12.52 | 132.9 | 99 | 5.98 | 6.9 | 5 | 1.86 | 0 | no |
1 | ID | 127 | area_code_408 | no | no | 0 | 102.8 | 128 | 17.48 | 143.7 | 95.0 | 12.21 | 191.4 | 97 | 8.61 | 10.0 | 5 | 2.70 | 1 | no |
2 | TX | 91 | area_code_415 | no | no | 0 | 251.5 | 57 | 42.76 | 179.1 | 113.0 | 15.22 | 163.2 | 72 | 7.34 | 6.6 | 3 | 1.78 | 1 | no |
3 | ND | 60 | area_code_510 | no | no | 0 | 203.2 | 99 | NaN | 235.8 | 131.0 | 20.04 | 224.9 | 112 | 10.12 | 15.1 | 6 | 4.08 | 2 | no |
4 | NV | 83 | area_code_510 | no | yes | 31 | 129.8 | 87 | 22.07 | 183.4 | 110.0 | 15.59 | 169.4 | 40 | 7.62 | 14.3 | 6 | 3.86 | 1 | no |
In [5]:
# We have provided with 19 input features and one target label
# 1.1 - Description analysis
# First thing I'll examine is a descriptive statistics:
#print statistics for each column:
for col in df_input.columns:
print(df_input[col].describe(),'\n','-'*60)
count 3500
unique 51
top WV
freq 111
Name: state, dtype: object
------------------------------------------------------------
count 3500.000000
mean 100.062286
std 40.126972
min 1.000000
25% 73.000000
50% 100.000000
75% 127.000000
max 232.000000
Name: account_length, dtype: float64
------------------------------------------------------------
count 3500
unique 3
top area_code_415
freq 1721
Name: area_code, dtype: object
------------------------------------------------------------
count 3500
unique 2
top no
freq 3183
Name: international_plan, dtype: object
------------------------------------------------------------
count 3500
unique 2
top no
freq 2579
Name: voice_mail_plan, dtype: object
------------------------------------------------------------
count 3500.000000
mean 7.654857
std 13.430216
min 0.000000
25% 0.000000
50% 0.000000
75% 16.000000
max 52.000000
Name: number_vmail_messages, dtype: float64
------------------------------------------------------------
count 3500.000000
mean 180.133229
std 53.946041
min 0.000000
25% 143.700000
50% 180.150000
75% 216.125000
max 351.500000
Name: total_day_minutes, dtype: float64
------------------------------------------------------------
count 3500.000000
mean 99.821429
std 19.841249
min 0.000000
25% 87.000000
50% 100.000000
75% 113.000000
max 165.000000
Name: total_day_calls, dtype: float64
------------------------------------------------------------
count 3216.000000
mean 30.643193
std 9.147634
min 0.000000
25% 24.480000
50% 30.620000
75% 36.725000
max 59.760000
Name: total_day_charge, dtype: float64
------------------------------------------------------------
count 3500.000000
mean 200.012857
std 50.504487
min 0.000000
25% 165.800000
50% 199.900000
75% 233.900000
max 359.300000
Name: total_eve_minutes, dtype: float64
------------------------------------------------------------
count 3269.000000
mean 100.109208
std 20.023304
min 0.000000
25% 87.000000
50% 100.000000
75% 114.000000
max 170.000000
Name: total_eve_calls, dtype: float64
------------------------------------------------------------
count 3500.000000
mean 17.001297
std 4.292901
min 0.000000
25% 14.090000
50% 16.990000
75% 19.880000
max 30.540000
Name: total_eve_charge, dtype: float64
------------------------------------------------------------
count 3500.000000
mean 199.786514
std 49.987834
min 43.700000
25% 167.000000
50% 199.300000
75% 233.325000
max 395.000000
Name: total_night_minutes, dtype: float64
------------------------------------------------------------
count 3500.000000
mean 99.714286
std 20.025978
min 33.000000
25% 86.000000
50% 99.500000
75% 113.000000
max 175.000000
Name: total_night_calls, dtype: float64
------------------------------------------------------------
count 3500.000000
mean 8.990494
std 2.249460
min 1.970000
25% 7.520000
50% 8.970000
75% 10.500000
max 17.770000
Name: total_night_charge, dtype: float64
------------------------------------------------------------
count 3500.000000
mean 10.284829
std 2.777536
min 0.000000
25% 8.600000
50% 10.400000
75% 12.100000
max 20.000000
Name: total_intl_minutes, dtype: float64
------------------------------------------------------------
count 3500.000000
mean 4.470286
std 2.497207
min 0.000000
25% 3.000000
50% 4.000000
75% 6.000000
max 20.000000
Name: total_intl_calls, dtype: float64
------------------------------------------------------------
count 3500.000000
mean 2.777434
std 0.749906
min 0.000000
25% 2.320000
50% 2.810000
75% 3.270000
max 5.400000
Name: total_intl_charge, dtype: float64
------------------------------------------------------------
count 3500.000000
mean 1.553714
std 1.315906
min 0.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 9.000000
Name: number_customer_service_calls, dtype: float64
------------------------------------------------------------
count 3500
unique 2
top no
freq 3013
Name: churn, dtype: object
------------------------------------------------------------
The following thing can be inferred from the descriptive statistics:
state
: 51 unique values. It looks like we have data from all US states.
It’s a good sign since it reduces the risk of selection bias in our data.
Let’s bring the distribution of observations by state:
In [6]:
fig , ax = plt.subplots(figsize=(8,5))
df_input['state'].value_counts().plot.bar(ax=ax,label='obs in state')
ax.axhline(df_input['state'].value_counts().mean(),color='red',label='Mean')
ax.set(title='Figure 1: # of obs by state')
ax.legend();
From figure 1 we can see that WV, MN, AL are highly overrepresented in the dataset
whereas CA and AK are underrepresented.
To eliminate the Selection bias, we should clarify this distribution with data owners!
In [7]:
churn_total = (df_input.replace({'yes':1,'no':0}).churn).value_counts()
fig , ax = plt.subplots(figsize=(5,5))
churn_total.plot.bar(ax=ax,label='# in class')
ax.set(title='Figure 2.1: Churn rate total')
ax.legend();
We can see that no
class is clearly underrepresented in this dataset (~14%)
This will cause a challenge in creating a classification model since the accuracy score by itself
won’t be that useful.
I assume the Recall measure is more suitable to capture the model’s ability to identify churn customers.
I will also try resampling technics to balance classes in the training data.
In [8]:
#Let's examine the churn rate by state
churn_by_state = (df_input.replace({'yes':1,'no':0})
.groupby('state')['churn']
.apply(lambda x: round(x.sum()/ x.count(),2))
.sort_values()
)
fig , ax = plt.subplots(figsize=(8,5))
churn_by_state.plot.bar(ax=ax,label='churn rate, pct')
ax.axhline(churn_by_state.mean(),color='red',label='Mean')
ax.set(title='Figure 2.1: Churn rate by state')
ax.legend();
Figure 2 shows an interesting story: The churn rate seems to be highly unbalanced state-by-state.
There might be several reasons for this including underrepresentation of the company in these regions,
high competitive pressure, or other factors. There’s a story behind that, and it’s a good follow-up
question to business people.
The decision we should make here whether or not to include the state
variable in the model.
If it is included, the training score would certainly be higher since the geo-location itself can be a good predictor.
BUT with no information about the causes of this distribution, the model might fail to generalize new data.
area_code
: there only three unique values. Why is that?
From the glance, this variable doesn’t provide much information
and is the candidate for exclusionaccount_length
: I assume this metric tends to trace how old the account in general.
The client with a two-digit account length is assumed to be `older than the one with three or more. Numerical representation of this metric might not be relevant.
I’ll replace this metric to bins of # of digits: {2,3,…}- In all numerical variables min value is always zero (not below) which
also is a good sign.
In [9]:
# I remove state variable in this exersice, to avoid geo bias in the model
# If
vars_to_drop = ['area_code'
,'state'
]
In [10]:
# Next one - sanity check of the provided dataset:
# define util function:
#print sanity check
sanity_stats = get_df_stats(df_input)
print(sanity_stats)
Feature Unique_testues \
8 total_day_charge 1608
10 total_eve_calls 118
11 total_eve_charge 1472
18 number_customer_service_calls 10
17 total_intl_charge 164
16 total_intl_calls 21
15 total_intl_minutes 164
14 total_night_charge 946
13 total_night_calls 124
12 total_night_minutes 1636
0 state 51
1 account_length 212
9 total_eve_minutes 1652
7 total_day_calls 118
6 total_day_minutes 1670
5 number_vmail_messages 46
4 voice_mail_plan 2
3 international_plan 2
2 area_code 3
19 churn 2
Percentage of missing values Number of missing values \
8 8.114286 284
10 6.600000 231
11 0.000000 0
18 0.000000 0
17 0.000000 0
16 0.000000 0
15 0.000000 0
14 0.000000 0
13 0.000000 0
12 0.000000 0
0 0.000000 0
1 0.000000 0
9 0.000000 0
7 0.000000 0
6 0.000000 0
5 0.000000 0
4 0.000000 0
3 0.000000 0
2 0.000000 0
19 0.000000 0
Percentage of values in the biggest category dtype
8 8.114286 float64
10 6.600000 float64
11 0.314286 float64
18 35.628571 int64
17 1.742857 float64
16 19.371429 int64
15 1.742857 float64
14 0.485714 float64
13 2.314286 int64
12 0.314286 float64
0 3.171429 object
1 1.257143 int64
9 0.228571 float64
7 2.371429 int64
6 0.257143 float64
5 73.714286 int64
4 73.685714 object
3 90.942857 object
2 49.171429 object
19 86.085714 object
Following things can be observed:
- Our target variable
churn
doesn’t have missing values, which is good. - We observe significant number of missing values in
total_day_charge
andtotal_eve_calls
variables
From the definition:- total_day_charge (total charge of day calls)
– total_eve_minutes (total minutes of evening calls)
Another thing that I noticed is that Percentage of missing values
== Percentage of values in the biggest category.
I assume it’s because of the missing values; let’s validate:
In [11]:
print('total_day_charge:\n',df_input['total_day_charge'].fillna(-9999).value_counts().iloc[:3])
print('total_eve_calls:\n',df_input['total_eve_calls'].fillna(-9999).value_counts().iloc[:3])
total_day_charge:
-9999.00 284
32.18 8
26.18 7
Name: total_day_charge, dtype: int64
total_eve_calls:
-9999.0 231
105.0 74
108.0 73
Name: total_eve_calls, dtype: int64
In [12]:
print('Total pct of missing value: ',sanity_stats['Number of missing values'].sum() / df_input.shape[0] )
Total pct of missing value: 0.14714285714285713
We have 14% of missing values which is significant enough, so we cannot just drop these rows because it’ll have a noticeable impact on the performance of the model. Hence, these two features must be treated either:
- Fill values using some method (mean, median, etc.)
- Remove these two features completely from the model if the model shows a small significance level
If it possible, it’s good to know when data was collected to prevent potential data leakage:
We don’t want our model to be trained on the future data and be validated on the past.
Even though it’s not that critical for this exercise, this type of data leakage can affect the production model dramatically.
n [13]:
# 1.2 - Correlation Analysis
'''
In this dataset, most of the variables are related to some usage patterns:
num. of minutes, num. of calls, etc. I assume these variables could be highly correlated and hence hold redundant information.
'''
#plot correlation matrix for numerical features
num_varialbes = df_input.select_dtypes(include=np.number).columns.tolist()
numeric_features = df_input[num_varialbes]
print('numerical variables: \n', num_varialbes)
fig = plt.figure(figsize=(8, 6))
plt.matshow(numeric_features.corr(), fignum=fig.number)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Figure 3: Correlation Matrix', fontsize=16);
print(numeric_features.corr().abs().applymap(lambda x : x>.85))
numerical variables:
['account_length', 'number_vmail_messages', 'total_day_minutes', 'total_day_calls', 'total_day_charge', 'total_eve_minutes', 'total_eve_calls', 'total_eve_charge', 'total_night_minutes', 'total_night_calls', 'total_night_charge', 'total_intl_minutes', 'total_intl_calls', 'total_intl_charge', 'number_customer_service_calls']
account_length number_vmail_messages \
account_length True False
number_vmail_messages False True
total_day_minutes False False
total_day_calls False False
total_day_charge False False
total_eve_minutes False False
total_eve_calls False False
total_eve_charge False False
total_night_minutes False False
total_night_calls False False
total_night_charge False False
total_intl_minutes False False
total_intl_calls False False
total_intl_charge False False
number_customer_service_calls False False
total_day_minutes total_day_calls \
account_length False False
number_vmail_messages False False
total_day_minutes True False
total_day_calls False True
total_day_charge True False
total_eve_minutes False False
total_eve_calls False False
total_eve_charge False False
total_night_minutes False False
total_night_calls False False
total_night_charge False False
total_intl_minutes False False
total_intl_calls False False
total_intl_charge False False
number_customer_service_calls False False
total_day_charge total_eve_minutes \
account_length False False
number_vmail_messages False False
total_day_minutes True False
total_day_calls False False
total_day_charge True False
total_eve_minutes False True
total_eve_calls False False
total_eve_charge False True
total_night_minutes False False
total_night_calls False False
total_night_charge False False
total_intl_minutes False False
total_intl_calls False False
total_intl_charge False False
number_customer_service_calls False False
total_eve_calls total_eve_charge \
account_length False False
number_vmail_messages False False
total_day_minutes False False
total_day_calls False False
total_day_charge False False
total_eve_minutes False True
total_eve_calls True False
total_eve_charge False True
total_night_minutes False False
total_night_calls False False
total_night_charge False False
total_intl_minutes False False
total_intl_calls False False
total_intl_charge False False
number_customer_service_calls False False
total_night_minutes total_night_calls \
account_length False False
number_vmail_messages False False
total_day_minutes False False
total_day_calls False False
total_day_charge False False
total_eve_minutes False False
total_eve_calls False False
total_eve_charge False False
total_night_minutes True False
total_night_calls False True
total_night_charge True False
total_intl_minutes False False
total_intl_calls False False
total_intl_charge False False
number_customer_service_calls False False
total_night_charge total_intl_minutes \
account_length False False
number_vmail_messages False False
total_day_minutes False False
total_day_calls False False
total_day_charge False False
total_eve_minutes False False
total_eve_calls False False
total_eve_charge False False
total_night_minutes True False
total_night_calls False False
total_night_charge True False
total_intl_minutes False True
total_intl_calls False False
total_intl_charge False True
number_customer_service_calls False False
total_intl_calls total_intl_charge \
account_length False False
number_vmail_messages False False
total_day_minutes False False
total_day_calls False False
total_day_charge False False
total_eve_minutes False False
total_eve_calls False False
total_eve_charge False False
total_night_minutes False False
total_night_calls False False
total_night_charge False False
total_intl_minutes False True
total_intl_calls True False
total_intl_charge False True
number_customer_service_calls False False
number_customer_service_calls
account_length False
number_vmail_messages False
total_day_minutes False
total_day_calls False
total_day_charge False
total_eve_minutes False
total_eve_calls False
total_eve_charge False
total_night_minutes False
total_night_calls False
total_night_charge False
total_intl_minutes False
total_intl_calls False
total_intl_charge False
number_customer_service_calls True
Nice plot, isn’t it? 🙂
But examine these regular correlation spikes, I found out that _time
and _charge
variables are highly correlated, and hence, one of them can be excluded. I’d prefer to use _time
variable as a feature.
In [14]:
#extend the list of variables to exclude from the model
vars_to_drop.extend([c for c in numeric_features if '_charge' in c])
In [15]:
def plot_dist(log,title ):
fig = plt.figure(figsize=(8, 7))
ax = fig.gca()
for col in num_varialbes:
if log:
np.log1p(df_input[col]).hist(ax=ax,alpha=.3,label=col)
else:
df_input[col].hist(ax=ax,alpha=.3,label=col)
ax.set(title=title)
ax.legend()
plot_dist(log=False, title='Figure 4: Num. Vars Distribution (Linear)')
#Let's plot the distribution of the log-transformed variables
plot_dist(log=True, title='Figure 5: Num. Vars Distribution (Log)')
We can see that most numerical features are distributed “more-or-less normally’
but have a different scale. This needs to be handled!
Since *_total
variables tend to increase,in production through time,
these metrics will be accumulated and push the distribution in the axis right. To address this,
I’ll use log(x+1) transform instead of norm() transform to address that.
In [16]:
# A pair plot for `total_*` variables conditioned on three categorical variables:
cat_variable = ['international_plan'
,'voice_mail_plan'
,'churn']
total_vars = [c for c in df_input.columns if 'total_' in c and c not in vars_to_drop]
# total_vars = total_vars[:4]
sample_size = 100
def plot_pairplot():
for i , col in enumerate(cat_variable[-1:]):
#sample dataframe for performance/plot optimization
df_toplot = (df_input[total_vars+[col]]
.sample(sample_size)
)
#create pair plot
ax = sns.pairplot(df_toplot,hue=col,size=1)
fig = plt.gcf()
fig.suptitle(f'Figure 4.{i}: Pair Plot {col}')
# break
plot_pairplot()
- Interesting that there is an order in churn distribution
total_day_minutes
- All other plots are distributed pretty much evenly.
In [17]:
#2 - DATA PROCESSING
In [18]:
target_label = 'churn'
#format featues names
num_varialbes_touse = total_vars + [c for c in df_input.columns if 'number_' in c]
cat_variables_touse = [c for c in cat_variable if target_label not in c ]
#split featues-target variables
df_totransform = df_input.copy()
#create the data pre-precessing object
processor = DataProcessing(num_varialbes_touse,cat_variables_touse
,vars_to_drop=vars_to_drop,to_log1p=True)
X , y = df_totransform.drop(target_label,axis=1) , df_totransform[[target_label]]
#Extract feature names for viz labeling
X_ = processor.transform(X,y)
features_names = processor.features_names
columns_names = X.columns.tolist()
In [19]:
#3 - MODEL ENGINERING
Before starting engineering any model I’ll establish the baseline model and evaluation metrics:
Baseline model:
– Most common classifier
– Random classifier
Evaluation metric:
1 – Recall score: minimizing the type 2 error
2 – Accuracy score
In [20]:
score_to_optimize = 'recall_score'
# 3.1 split input dataset on training- validation datasets - test dataset: Create Data Layer
dl = DataLayer(X,y,None,target_label,random_state=RANDOM_SEED)
def make_dummy_model(dl,strategy):
'''
A Wrapper for the dummy classifyer model
'''
print('*'*60,
'\nDummy model with strategy: ', strategy)
dummy_clf = Pipeline(steps = [('processing', DataProcessing(num_varialbes_touse,cat_variables_touse
,vars_to_drop=vars_to_drop,to_log1p=True,columns_names=columns_names))
,('dummy_clf', DummyClassifier(strategy=strategy))
])
dummy_clf.fit(dl.X_train,dl.y_train)
y_hat = dummy_clf.predict(dl.X_test)
print_report(dl.y_test,y_hat)
# 3.2 Calculate scores on the baseline (null) models
make_dummy_model(dl,'most_frequent')
make_dummy_model(dl,'uniform')
************************************************************
Dummy model with strategy: most_frequent
Confusion matrix:
pred_neg pred_pos
neg 1005 0
pos 150 0
precision recall f1-score support
0.0 0.87 1.00 0.93 1005
1.0 0.00 0.00 0.00 150
accuracy 0.87 1155
macro avg 0.44 0.50 0.47 1155
weighted avg 0.76 0.87 0.81 1155
************************************************************
Dummy model with strategy: uniform
Confusion matrix:
pred_neg pred_pos
neg 498 507
pos 71 79
precision recall f1-score support
0.0 0.88 0.50 0.63 1005
1.0 0.13 0.53 0.21 150
accuracy 0.50 1155
macro avg 0.51 0.51 0.42 1155
weighted avg 0.78 0.50 0.58 1155
most_frequent
dummy model tells us what would be the model accuracy if we only use
the most frequent class: 87%.
This is our model’s baseline accuracy. We can not accept an alternative model if it has a lower overall
accuracy.
uniform
dummy model tells us the baseline recall
score: 51%
We can not accept an alternative model if it has a lower recall
performance than this.
3.3 Create alternative models.
Model Selection reasoning:
I chose to test three models:
-Logistic Regression
-Decision Tree Classifier.
-XGBoost
Parameters selection justification:
I not want my model to overfit the data so the max depth should not be
very high ans well as min values for split and leaf should be at least
In [21]:
'''
#3.3.1
'''
#model 1
#Create param greed for the model
param_RegClf = {'LogRegClf__C': [0.01,0.1,1,10]
,'LogRegClf__penalty' : ['l1','l2','elasticnet']
}
#define the model pipeline
RegClf = Pipeline(steps = [('Processing', DataProcessing(num_varialbes_touse,cat_variables_touse
,vars_to_drop=vars_to_drop,to_log1p=True,columns_names=columns_names))
,('LogRegClf', LogisticRegression()
)
])
#model 2
param_DT = {'DTClf__criterion': ['gini','entropy']
,'DTClf__max_depth':range(3,6)
,'DTClf__min_samples_split': range(10,15)
,'DTClf__min_samples_leaf':range(10,15)
}
#define the model pipeline
DTClf = Pipeline(steps = [('Processing', DataProcessing(num_varialbes_touse,cat_variables_touse
,vars_to_drop=vars_to_drop,to_log1p=True,columns_names=columns_names))
,('DTClf', DecisionTreeClassifier()
)
])
#model 3
param_XGB = {
'XGBClf__max_depth': range(3,6),
'XGBClf__min_child_weight':range(2,6),
'XGBClf__subsample':[1],
'XGBClf__colsample_bytree':[1],
'XGBClf__eta':[.3],
'XGBClf__objective':['binary:logistic'],
'XGBClf__n_jobs':[-1],
'XGBClf__verbosity': [1],
'XGBClf__eval_metrics': ["auc"] }
#define the model pipeline
XGBClf = Pipeline(steps = [('Processing', DataProcessing(num_varialbes_touse,cat_variables_touse
,vars_to_drop=vars_to_drop,to_log1p=True,columns_names=columns_names))
,('XGBClf', XGBClassifier()
)
])
#Train Loistic Regression
best_model_LR = run_grid_search(dl,RegClf,param_RegClf,score_to_optimize,model_name='RegClf')
#Train Decigion Tree
best_model_DT = run_grid_search(dl,DTClf,param_DT,score_to_optimize,model_name='DTClf')
#Train XGBoost
best_model_XGB = run_grid_search(dl,XGBClf,param_XGB,score_to_optimize,model_name='XGBClf')
************************************************************
Starting training model: RegClf
Best params for recall_score : {'LogRegClf__C': 0.01, 'LogRegClf__penalty': 'l2'}
Grid Search best scores: 0.7191226230581026
Time took (sec.): 3
Confusion matrix:
pred_neg pred_pos
neg 663 342
pos 38 112
precision recall f1-score support
0.0 0.95 0.66 0.78 1005
1.0 0.25 0.75 0.37 150
accuracy 0.67 1155
macro avg 0.60 0.70 0.57 1155
weighted avg 0.86 0.67 0.72 1155
************************************************************
Starting training model: DTClf
Best params for recall_score : {'DTClf__criterion': 'entropy', 'DTClf__max_depth': 5, 'DTClf__min_samples_leaf': 14, 'DTClf__min_samples_split': 10}
Grid Search best scores: 0.8521049461214109
Time took (sec.): 26
Confusion matrix:
pred_neg pred_pos
neg 948 57
pos 35 115
precision recall f1-score support
0.0 0.96 0.94 0.95 1005
1.0 0.67 0.77 0.71 150
accuracy 0.92 1155
macro avg 0.82 0.85 0.83 1155
weighted avg 0.93 0.92 0.92 1155
************************************************************
Starting training model: XGBClf
[19:36:16] WARNING: ../src/learner.cc:573:
Parameters: { "eval_metrics" } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
[19:36:16] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Best params for recall_score : {'XGBClf__colsample_bytree': 1, 'XGBClf__eta': 0.3, 'XGBClf__eval_metrics': 'auc', 'XGBClf__max_depth': 5, 'XGBClf__min_child_weight': 2, 'XGBClf__n_jobs': -1, 'XGBClf__objective': 'binary:logistic', 'XGBClf__subsample': 1, 'XGBClf__verbosity': 1}
Grid Search best scores: 0.9965151819378445
Time took (sec.): 131
Confusion matrix:
pred_neg pred_pos
neg 983 22
pos 38 112
precision recall f1-score support
0.0 0.96 0.98 0.97 1005
1.0 0.84 0.75 0.79 150
accuracy 0.95 1155
macro avg 0.90 0.86 0.88 1155
weighted avg 0.95 0.95 0.95 1155
3.3.2 Selecting Best Model Architechture
Evaluation of the models’ results:
- XGB has a higher CV score across all models.
- All models passed the baseline with
recall
metrics. - Logistic Regression has a lower accuracy score than the
most_frequent
null model,
hence it wouldn’t be selected for future analysis.
Logistic Regression:
Recall = 75% (with overall accuracy 67%)
Decision Tree scores:
Recall = 77% (with overall accuracy 92%)
Which is 24 percentage point higher than the uniform null model
XGB scores:
Recall = 75% (with overall accuracy 95%)
Which is 22 percentage point higher than the uniform null model
If to take into the account model performance using recall only, a Decision Tree would be
the favorite. BUT we see that it has done much weaker in precision
score (67% vs 84% for XGB)
Even though the type 2 error has a higher cost: missing the customer who is about to churn,
Misclassification the significant number of users as an about to churn
might create a wrong
perspective about the overall user experience and user satisfaction for the business.
Hence, I’m choosing the XGB model as the one to use.
In [22]:
best_model = best_model_XGB
y_scores = best_model.predict_proba(dl.X_test)[:, 1]
# importance = best_model['LogRegClf'].coef_[0]
importance = best_model['XGBClf'].feature_importances_
In [23]:
#3.3.3 Plot detailed report of for the best model
plot_reports(dl.y_test,y_scores,score_to_optimize,initial_threshold=0.5,title_prefix='Figure 6.')
0.9102023217247098
Figures interpretation:
- Figure 6.1
Shows the Precision-Recall tradeoff and current threshold position there.
We see that futher optimizing for recall won’t improverecall
much but would
dramatically reduce theprecision
score. - Figure 6.2
Shows bothprecision
andrecall
scores as a function of the threshold.
It complements Figure 6.1 in the way that the threshold should not be further adjusted. - Figure 6.3
Is the ROC Curve. Which can be interpreted as “how much our classifier does better than a random guess.”
We see that the ROC curve is very good!
Actually, it’s migh be to-good-to-be-true! I should not exclude the possibility
that the model is overfitting due to some sort of data leakage.
In [24]:
#3.3.4 Plot features importance
plot_feature_importance(importance,processor.features_names,'Figure 7: DTClf Feature Importance')
Feature: international_plan, Score: 0.2608954906463623
Feature: voice_mail_plan, Score: 0.12016084045171738
Feature: number_vmail_messages, Score: 0.04120601713657379
Feature: total_day_minutes, Score: 0.0951744019985199
Feature: total_day_calls, Score: 0.023855485022068024
Feature: total_eve_minutes, Score: 0.042013008147478104
Feature: total_eve_calls, Score: 0.030472613871097565
Feature: total_night_minutes, Score: 0.02721640281379223
Feature: total_night_calls, Score: 0.019038571044802666
Feature: total_intl_minutes, Score: 0.06032373756170273
Feature: total_intl_calls, Score: 0.05165725201368332
Feature: number_customer_service_calls, Score: 0.20180296897888184
Feature: account_length_bin, Score: 0.026183191686868668
Features importance interpretations:
Top four important features in the model:
international_plan
number_customer_service_calls
voice_mail_plan
total_day_minutes
.
The pattern is actually so strong that it’s visible in the pair plots! (Section 1.2, figure 4.3)
I measured the importance of the feature by obtaining feature_importances_
metric, which is calculated
by measuring how many splits were on that feature.
In [25]:
#Plot Tree
fig, ax = plt.subplots(figsize=(8,8))
# plot_tree(best_model['DTClf']
# ,rounded = True
# ,filled = True
# ,feature_names=processor.features_names
# ,class_names=True
# ,fontsize=5
# ,ax=ax
# )
plot_tree_xgb(best_model['XGBClf'],ax=ax,features_names=processor.features_names)
ax.set(title = 'Figure 8: Decision Tree Plot');
Potential next steps:
- Try out ensemble of different models and higher regularisation
- Investigate potential data leakage in the Decision Tree model
Since this exercise has time constraints, I didn’t test the following hypothesis:
- Investigate the geo-biased model and compare results.
The current model doesn’t have state
as a parameter, and I did it on purpose since the disbalance was found in the churn distribution by states (see Figure 2.1). It might be the temporal effect due to some business factors, so I don’t want the model to judge based on where the user is from. The core assumption here
is that churn
event should be related to behavioral factors.
But this hypothesis is need to be tested by compare performance of both: geo-aweard
and geo-not-awared
models as well as
by consulting with business users.
- Even though the Regression model didn’t pass the null threshold, it might be fine-tuned for a better score. Then, two models can be used
as an ensemble to create a wotted prediction.