In [5]:

Copied!





import pandas as pd
from pandas import option_context
import dask.dataframe as dd
import os
import pandas as pd
from pandas import option_context
import dask.dataframe as dd
import os

Data Exploration¶

1. Background¶

1.1 | What is asked of us¶

Notebook Overview

How to investigate whether price sensitivity is the most influential factor for a customer churning
Conduct exploratory data analysis

Price sensitivity is the degree to which demand changes when the cost of a product or service changes

In the context of PowerCo, the “demand” refers to the demand for energy consumption
Price sensitivity is commonly measured using the price elasticity of demand, which states that some consumers won't pay more if a lower-priced option is available

What is price elasticity of demand?

Price elasticity of demand is a measurement of the change in consumption of a product in relation to a change in its price

1.2 | What the client is handing over¶

Our client has sent over 3 data sets (shown below):

Historical customer: Customer data such as usage, sign up date, forecasted usage etc
Historical pricing: variable and fixed pricing data etc
Churn indicator: whether each customer has churned or not

You need to analyze the following things:

The data types of each column
Descriptive statistics of the dataset
Distributions of columns

1.3 | File Previews¶

In [6]:

Copied!

# !head -5 price_data.csv
!head -5 client_data.csv
# !head -5 price_data.csv
!head -5 client_data.csv

id,channel_sales,cons_12m,cons_gas_12m,cons_last_month,date_activ,date_end,date_modif_prod,date_renewal,forecast_cons_12m,forecast_cons_year,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,has_gas,imp_cons,margin_gross_pow_ele,margin_net_pow_ele,nb_prod_act,net_margin,num_years_antig,origin_up,pow_max,churn
24011ae4ebbe3035111d65fa7c15bc57,foosdfpfkusacimwkcsosbicdxkicaua,0,54946,0,2013-06-15,2016-06-15,2015-11-01,2015-06-23,0.0,0,0.0,1.78,0.114481,0.098142,40.606701,t,0.0,25.44,25.44,2,678.99,3,lxidpiddsbxsbosboudacockeimpuepw,43.648,1
d29c2c54acc38ff3c0614d0a653813dd,MISSING,4660,0,0,2009-08-21,2016-08-30,2009-08-21,2015-08-31,189.95,0,0.0,16.27,0.1457109999999999,0.0,44.31137796,f,0.0,16.38,16.38,1,18.89,6,kamkkxfxxuwbdslkwifmmcsiusiuosws,13.8,0
764c75f661154dac3a6c254cd082ea7d,foosdfpfkusacimwkcsosbicdxkicaua,544,0,0,2010-04-16,2016-04-16,2010-04-16,2015-04-17,47.96,0,0.0,38.72,0.165794,0.087899,44.31137796,f,0.0,28.6,28.6,1,6.6,6,kamkkxfxxuwbdslkwifmmcsiusiuosws,13.856,0
bba03439a292a1e166f80264c16191cb,lmkebamcaaclubfxadlmueccxoimlema,1584,0,0,2010-03-30,2016-03-30,2010-03-30,2015-03-31,240.04,0,0.0,19.83,0.146694,0.0,44.31137796,f,0.0,30.22,30.22,1,25.46,6,kamkkxfxxuwbdslkwifmmcsiusiuosws,13.2,0

In [7]:

Copied!

!ls . -l -h
!ls . -l -h

total 23M
-rw-rw-r-- 1 andrey andrey 3.4M Feb  8 17:14  client_data.csv
-rw-rw-r-- 1 andrey andrey  18M Feb  8 17:14  price_data.csv
-rw-rw-r-- 1 andrey andrey  70K Feb  8 18:23 'Task 2 - eda_starter.ipynb'
-rw-rw-r-- 1 andrey andrey 285K Feb 10 00:47  Task2-edit.ipynb
-rw-rw-r-- 1 andrey andrey 287K Feb 16 20:58  Task2.ipynb
-rw-rw-r-- 1 andrey andrey 587K Feb  8 21:36 'Task 2 - Model Answer - EDA.ipynb'

In [8]:

Copied!

path = 'task2'
clients_path = os.path.join('','client_data.csv')
prices_path = os.path.join('','price_data.csv')
path = 'task2'
clients_path = os.path.join('','client_data.csv')
prices_path = os.path.join('','price_data.csv')

2. Dataset Preview¶

2.1 | Clients Dataset¶

Basic problem setup is a customer churn one, we have information about each unique customers of PowerCo

activity_new: Category of the company’s activity
channel_sales: Code of the sales channel
cons_12m: Electricity consumption of the past 12 months
cons_gas_12m: Gas consumption of the past 12 months
cons_last_month: Electricity consumption of the last month
date_activ: Date of activation of the contract
date_end: Registered date of the end of the contract
date_modif_prod: Date of the last modification of the product
date_renewal: Date of the next contract renewal
forecast_cons_12m: Forecasted electricity consumption for next 12 months
forecast_cons_year: Forecasted electricity consumption for the next calendar year
forecast_discount_energy: Forecasted value of current discount
forecast_meter_rent_12m: Forecasted bill of meter rental for the next 2 months
forecast_price_energy_off_peak: Forecasted energy price for 1st period (off peak)
forecast_price_energy_peak: Forecasted energy price for 2nd period (peak)
forecast_price_pow_off_peak: Forecasted power price for 1st period (off peak)
has_gas: Indicated if client is also a gas client
imp_cons: Current paid consumption
margin_gross_pow_ele: Gross margin on power subscription
margin_net_pow_ele: Net margin on power subscription
nb_prod_act: Number of active products and services
net_margin: Total net margin
num_years_antig: Antiquity of the client (in number of years)
origin_up: Code of the electricity campaign the customer first subscribed to
pow_max: Subscribed power
churn: Has the client churned over the next 3 months

In [9]:

Copied!

# each row corresponds to data about the unique customer
client = dd.read_csv(clients_path,sep=',', blocksize="30M")
# each row corresponds to data about the unique customer
client = dd.read_csv(clients_path,sep=',', blocksize="30M")

The column data types are summarised below

In [10]:

Copied!

client.dtypes
client.dtypes

Out[10]:

id                                string[pyarrow]
channel_sales                     string[pyarrow]
cons_12m                                    int64
cons_gas_12m                                int64
cons_last_month                             int64
date_activ                        string[pyarrow]
date_end                          string[pyarrow]
date_modif_prod                   string[pyarrow]
date_renewal                      string[pyarrow]
forecast_cons_12m                         float64
forecast_cons_year                          int64
forecast_discount_energy                  float64
forecast_meter_rent_12m                   float64
forecast_price_energy_off_peak            float64
forecast_price_energy_peak                float64
forecast_price_pow_off_peak               float64
has_gas                           string[pyarrow]
imp_cons                                  float64
margin_gross_pow_ele                      float64
margin_net_pow_ele                        float64
nb_prod_act                                 int64
net_margin                                float64
num_years_antig                             int64
origin_up                         string[pyarrow]
pow_max                                   float64
churn                                       int64
dtype: object

We have a number of categorical string features which we can use to investigate why clients are churning

In [11]:

Copied!

# show all rows in the dataframe
with option_context('display.max_columns', None):
    display(client.head(5))
# show all rows in the dataframe
with option_context('display.max_columns', None):
    display(client.head(5))

	id	channel_sales	cons_12m	cons_gas_12m	cons_last_month	date_activ	date_end	date_modif_prod	date_renewal	forecast_cons_12m	forecast_cons_year	forecast_meter_rent_12m	forecast_price_energy_off_peak	forecast_price_energy_peak	forecast_price_pow_off_peak	has_gas	imp_cons	margin_gross_pow_ele	margin_net_pow_ele	nb_prod_act	net_margin	num_years_antig	origin_up	pow_max	churn
0	24011ae4ebbe3035111d65fa7c15bc57	foosdfpfkusacimwkcsosbicdxkicaua	0	54946	0	2013-06-15	2016-06-15	2015-11-01	2015-06-23	0.00	0	1.78	0.114481	0.098142	40.606701	t	0.00	25.44	25.44	2	678.99	3	lxidpiddsbxsbosboudacockeimpuepw	43.648	1
1	d29c2c54acc38ff3c0614d0a653813dd	MISSING	4660	0	0	2009-08-21	2016-08-30	2009-08-21	2015-08-31	189.95	0	16.27	0.145711	0.000000	44.311378	f	0.00	16.38	16.38	1	18.89	6	kamkkxfxxuwbdslkwifmmcsiusiuosws	13.800	0
2	764c75f661154dac3a6c254cd082ea7d	foosdfpfkusacimwkcsosbicdxkicaua	544	0	0	2010-04-16	2016-04-16	2010-04-16	2015-04-17	47.96	0	38.72	0.165794	0.087899	44.311378	f	0.00	28.60	28.60	1	6.60	6	kamkkxfxxuwbdslkwifmmcsiusiuosws	13.856	0
3	bba03439a292a1e166f80264c16191cb	lmkebamcaaclubfxadlmueccxoimlema	1584	0	0	2010-03-30	2016-03-30	2010-03-30	2015-03-31	240.04	0	19.83	0.146694	0.000000	44.311378	f	0.00	30.22	30.22	1	25.46	6	kamkkxfxxuwbdslkwifmmcsiusiuosws	13.200	0
4	149d57cf92fc41cf94415803a877cb4b	MISSING	4425	0	526	2010-01-13	2016-03-07	2010-01-13	2015-03-09	445.75	526	131.73	0.116900	0.100015	40.606701	f	52.32	44.91	44.91	1	47.98	6	kamkkxfxxuwbdslkwifmmcsiusiuosws	19.800	0

In [12]:

Copied!

# number of rows
client['id'].count().compute()
# number of rows
client['id'].count().compute()

Out[12]:

In [13]:

Copied!

client[client['id'] == '038af19179925da21a25619c5a24b745'].compute()
client[client['id'] == '038af19179925da21a25619c5a24b745'].compute()

Out[13]:

	id	channel_sales	cons_12m	cons_gas_12m	cons_last_month	date_activ	date_end	date_modif_prod	date_renewal	forecast_cons_12m	...	has_gas	imp_cons	margin_gross_pow_ele	margin_net_pow_ele	nb_prod_act	net_margin	num_years_antig	origin_up	pow_max	churn
11047	038af19179925da21a25619c5a24b745	foosdfpfkusacimwkcsosbicdxkicaua	3576	0	630	2013-06-03	2016-06-03	2015-04-29	2015-06-14	531.5	...	f	91.82	21.52	21.52	1	52.53	3	ldkssxwpmemidmecebumciepifcamkci	13.2	0

1 rows × 26 columns

2.2 | Electiricty Prices Data¶

Dataset contains for each of PowerCo's clients, the pricing variation for each month

id = client company identifier
price_date = reference date
price_off_peak_var = price of energy for the 1st period (off peak)
price_peak_var = price of energy for the 2nd period (peak)
price_mid_peak_var = price of energy for the 3rd period (mid peak)
price_off_peak_fix = price of power for the 1st period (off peak)
price_peak_fix = price of power for the 2nd period (peak)
price_mid_peak_fix = price of power for the 3rd period (mid peak)

This data will allow us to check the hypothesis that was asked of us, we'll do that in a later section

In [14]:

Copied!

prices = dd.read_csv(prices_path,sep=',',blocksize="30M")
prices.head()
prices = dd.read_csv(prices_path,sep=',',blocksize="30M")
prices.head()

Out[14]:

	id	price_date	price_off_peak_var	price_off_peak_fix
0	038af19179925da21a25619c5a24b745	2015-01-01	0.151367	44.266931
1	038af19179925da21a25619c5a24b745	2015-02-01	0.151367	44.266931
2	038af19179925da21a25619c5a24b745	2015-03-01	0.151367	44.266931
3	038af19179925da21a25619c5a24b745	2015-04-01	0.149626	44.266931
4	038af19179925da21a25619c5a24b745	2015-05-01	0.149626	44.266931

In [15]:

Copied!

prices['id'].count().compute()
prices['id'].count().compute()

Out[15]:

Looks like we have pricing data for each of PowerCo's clients for the year 2015

In [16]:

Copied!

# prices for each customer by month
prices[prices['id'] == '038af19179925da21a25619c5a24b745'].compute()
# prices for each customer by month
prices[prices['id'] == '038af19179925da21a25619c5a24b745'].compute()

Out[16]:

	id	price_date	price_off_peak_var	price_off_peak_fix
0	038af19179925da21a25619c5a24b745	2015-01-01	0.151367	44.266931
1	038af19179925da21a25619c5a24b745	2015-02-01	0.151367	44.266931
2	038af19179925da21a25619c5a24b745	2015-03-01	0.151367	44.266931
3	038af19179925da21a25619c5a24b745	2015-04-01	0.149626	44.266931
4	038af19179925da21a25619c5a24b745	2015-05-01	0.149626	44.266931
5	038af19179925da21a25619c5a24b745	2015-06-01	0.149626	44.266930
6	038af19179925da21a25619c5a24b745	2015-07-01	0.150321	44.444710
7	038af19179925da21a25619c5a24b745	2015-08-01	0.145859	44.444710
8	038af19179925da21a25619c5a24b745	2015-09-01	0.145859	44.444710
9	038af19179925da21a25619c5a24b745	2015-10-01	0.145859	44.444710
10	038af19179925da21a25619c5a24b745	2015-11-01	0.145859	44.444710
11	038af19179925da21a25619c5a24b745	2015-12-01	0.145859	44.444710

3. Descriptive Statistics¶

Lets look at some of the basic univariate statistics and some more in depth relations that we can find in the data

3.1 | What we'll do¶

We want to first do some on the surface statistic investigations and ultimately,
we'll need to answer questions regarding what influences churn of customers
First lets check the percentage of clients who have churned in the next 3 months

In [50]:

Copied!

(client['churn'].value_counts(normalize=True)*100).compute()
(client['churn'].value_counts(normalize=True)*100).compute()

Out[50]:

churn
0    90.284814
1     9.715186
Name: proportion, dtype: float64

3.2 | Customer data¶

In [17]:

Copied!

client.describe().compute().round(3)
client.describe().compute().round(3)

Out[17]:

	cons_12m	cons_gas_12m	cons_last_month	forecast_cons_12m	forecast_cons_year	forecast_discount_energy	forecast_meter_rent_12m	forecast_price_energy_off_peak	forecast_price_energy_peak	forecast_price_pow_off_peak	imp_cons	margin_gross_pow_ele	margin_net_pow_ele	nb_prod_act	net_margin	num_years_antig	pow_max	churn
count	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000	14606.000
mean	159220.286	28092.375	16090.270	1868.615	1399.763	0.967	63.087	0.137	0.050	43.130	152.787	24.565	24.563	1.292	189.265	4.998	18.135	0.097
std	573465.264	162973.059	64364.196	2387.572	3247.786	5.108	66.166	0.025	0.049	4.486	341.369	20.231	20.230	0.710	311.798	1.612	13.535	0.296
min	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.000	0.000	1.000	3.300	0.000
25%	5674.750	0.000	0.000	494.995	0.000	0.000	16.180	0.116	0.000	40.607	0.000	14.280	14.280	1.000	50.712	4.000	12.500	0.000
50%	14115.500	0.000	792.500	1112.875	314.000	0.000	18.795	0.143	0.084	44.311	37.395	21.640	21.640	1.000	112.530	5.000	13.856	0.000
75%	40763.750	0.000	3383.000	2401.790	1745.750	0.000	131.030	0.146	0.099	44.311	193.980	29.880	29.880	1.000	243.098	6.000	19.172	0.000
max	6207104.000	4154590.000	771203.000	82902.830	175375.000	30.000	599.310	0.274	0.196	59.266	15042.790	374.640	374.640	32.000	24570.650	13.000	320.000	1.000

If we look at the number of churned clients, they are quite low; 9.72%
Nevertheless, our client has raised concer so we need to investigate what affects this churn

In [18]:

Copied!

(client['churn'].value_counts(normalize=True).compute() * 100).round(2)
(client['churn'].value_counts(normalize=True).compute() * 100).round(2)

Out[18]:

churn
0    90.28
1     9.72
Name: proportion, dtype: float64

3.3 | Price Data¶

Lets look at the statistics for each of the pricing columns for all customers

In [19]:

Copied!

prices.describe().compute().round(3).T
prices.describe().compute().round(3).T

Out[19]:

	count	mean	std	25%	50%	75%	max
price_off_peak_var	193002.0	0.141	0.025	0.126	0.146	0.152	0.281
price_peak_var	193002.0	0.055	0.050	0.000	0.085	0.102	0.230
price_mid_peak_var	193002.0	0.030	0.036	0.000	0.000	0.073	0.114
price_off_peak_fix	193002.0	43.334	5.410	40.729	44.267	44.445	59.445
price_peak_fix	193002.0	10.623	12.842	0.000	0.000	24.340	36.491
price_mid_peak_fix	193002.0	6.410	7.774	0.000	0.000	16.226	17.458

4. What influences churn¶

Lets investigate what factors affect the churn, and relay them via statistics and visualisations

We'll be limiting our investigation to customer features only

4.1 | Sales Channel¶

Sales channel doesn't seem to be a linear features that affects churn
Some groups which affect churn more than others, notably group foosdfpfkusacimwkcsosbicdxkicaua make up about half of the churns

In [20]:

Copied!

churned = client[client['churn'] == 1]
notchurn = client[client['churn'] == 0]
churned = client[client['churn'] == 1]
notchurn = client[client['churn'] == 0]

Lets look at the numbers relative to the churned clients

In [21]:

Copied!





churn_counts = churned.groupby(['channel_sales']).agg(counts=('id','count')).compute()
churn_counts['count%'] = ((churn_counts['counts']/churn_counts['counts'].sum()) * 100).round(2)
churn_counts = churn_counts.reset_index()
churn_counts.sort_values(by='count%',ascending=False)
churn_counts = churned.groupby(['channel_sales']).agg(counts=('id','count')).compute()
churn_counts['count%'] = ((churn_counts['counts']/churn_counts['counts'].sum()) * 100).round(2)
churn_counts = churn_counts.reset_index()
churn_counts.sort_values(by='count%',ascending=False)

Out[21]:

	channel_sales	counts	count%
0	foosdfpfkusacimwkcsosbicdxkicaua	820	57.79
2	MISSING	283	19.94
1	usilxuppasemubllopkaafesmlibmsdf	138	9.73
3	lmkebamcaaclubfxadlmueccxoimlema	103	7.26
4	ewpakwlliwisiwduibdlfmalxowmwpci	75	5.29

Now lets look at relative to each group in channel_sales
Lets also create a helper function which allows us to compare percentage wise the variation of churned to not churned clients

In [22]:

Copied!





def compare_percentage(column:str,df:pd.DataFrame=client):

    churn_counts = df.groupby([column,'churn']).agg(counts=('id','count')).compute()
    churn_counts = churn_counts.reset_index()
    all_counts = client.groupby([column]).agg(counts=('id','count')).compute()
    
    merged = churn_counts.merge(all_counts,left_on=column,right_index=True)
    merged['percent%'] = ((merged['counts_x']/merged['counts_y'])*100).round(2)
    merged = merged.groupby([column,'churn'])['percent%'].mean().to_frame().sort_index()
    return merged

compare_percentage('channel_sales')
def compare_percentage(column:str,df:pd.DataFrame=client):

    churn_counts = df.groupby([column,'churn']).agg(counts=('id','count')).compute()
    churn_counts = churn_counts.reset_index()
    all_counts = client.groupby([column]).agg(counts=('id','count')).compute()
    
    merged = churn_counts.merge(all_counts,left_on=column,right_index=True)
    merged['percent%'] = ((merged['counts_x']/merged['counts_y'])*100).round(2)
    merged = merged.groupby([column,'churn'])['percent%'].mean().to_frame().sort_index()
    return merged

compare_percentage('channel_sales')

Out[22]:

		percent%
channel_sales	churn
MISSING	0	92.40
MISSING	1	7.60
epumfxlbckeskwekxbiuasklxalciiuu	0	100.00
ewpakwlliwisiwduibdlfmalxowmwpci	0	91.60
ewpakwlliwisiwduibdlfmalxowmwpci	1	8.40
fixdbufsefwooaasfcxdxadsiekoceaa	0	100.00
foosdfpfkusacimwkcsosbicdxkicaua	0	87.86
foosdfpfkusacimwkcsosbicdxkicaua	1	12.14
lmkebamcaaclubfxadlmueccxoimlema	0	94.41
lmkebamcaaclubfxadlmueccxoimlema	1	5.59
sddiedcslfslkckwlfkdpoeeailfpeds	0	100.00
usilxuppasemubllopkaafesmlibmsdf	0	89.96
usilxuppasemubllopkaafesmlibmsdf	1	10.04

4.2 | Gas Active Clients¶

One other features we should look at specifies whether the client also a gas client
We can note that if the clients are also gas customers, the likelihood of them churning is substantially smaller
This would definitely be a point to raise to prevent current users from switching to other companies

In [23]:

Copied!

gas_counts = churned.groupby(['has_gas']).agg(counts=('id','count')).compute()
gas_counts['count%'] = ((gas_counts['counts']/gas_counts['counts'].sum()) * 100).round(2)
gas_counts.sort_values(by='count%',ascending=False)
gas_counts = churned.groupby(['has_gas']).agg(counts=('id','count')).compute()
gas_counts['count%'] = ((gas_counts['counts']/gas_counts['counts'].sum()) * 100).round(2)
gas_counts.sort_values(by='count%',ascending=False)

Out[23]:

	counts	count%
has_gas
f	1202	84.71
t	217	15.29

In [24]:

Copied!

compare_percentage('has_gas')
compare_percentage('has_gas')

Out[24]:

		percent%
has_gas	churn
f	0	89.95
f	1	10.05
t	0	91.81
t	1	8.19

4.3 | Correlation to Churn¶

We can also compare how the number of existing customer products affects their choices amonstt other numeric columns
Logically, the higher the number of products a customer has, the lower the changes of them churning

In [25]:

Copied!

import warnings; warnings.filterwarnings('ignore')

client.select_dtypes(include=['float64','int64']).corr().round(2).compute().loc['churn'].to_frame()
import warnings; warnings.filterwarnings('ignore')

client.select_dtypes(include=['float64','int64']).corr().round(2).compute().loc['churn'].to_frame()

Out[25]:

	churn
cons_12m	-0.05
cons_gas_12m	-0.04
cons_last_month	-0.05
forecast_cons_12m	0.01
forecast_cons_year	-0.00
forecast_discount_energy	0.02
forecast_meter_rent_12m	0.04
forecast_price_energy_off_peak	-0.01
forecast_price_energy_peak	0.03
forecast_price_pow_off_peak	0.01
imp_cons	-0.00
margin_gross_pow_ele	0.10
margin_net_pow_ele	0.10
nb_prod_act	-0.01
net_margin	0.04
num_years_antig	-0.07
pow_max	0.03
churn	1.00

From the results we can see that most numeric features are very mildly linearly correlated to churn
This metric doesn't quite give us enough detail about the effects of the number of products

4.4 | Number of Products¶

Lets check is more detail, if indeed there is any relation to churn

We can see that beyond 5 products, the churn percentage is 0
Anything below this number is about the same

In [26]:

Copied!

client['nb_prod_act_cat'] = client['nb_prod_act'].astype('category')
compare_percentage('nb_prod_act_cat')
client['nb_prod_act_cat'] = client['nb_prod_act'].astype('category')
compare_percentage('nb_prod_act_cat')

Out[26]:

		percent%
nb_prod_act_cat	churn
1	0	90.02
1	1	9.98
2	0	91.49
2	1	8.51
3	0	90.06
3	1	9.94
4	0	90.00
4	1	10.00
5	0	90.32
5	1	9.68
6	0	100.00
6	1	0.00
8	0	100.00
8	1	0.00
9	0	100.00
9	1	0.00
10	0	100.00
10	1	0.00
32	0	100.00
32	1	0.00

5. Hypothesis Investigation¶

We looked at various factors that can affect churn from the customer dataset. Lets now check the pricing for each customer and check the customers sensitivity to pricing.

5.1 | Standard Deviations¶

Since we have the consumption data for each of the companies for the year of 2015, we will create new features to measure price sensitivity

In [27]:

Copied!

prices.head()
prices.head()

Out[27]:

	id	price_date	price_off_peak_var	price_off_peak_fix
0	038af19179925da21a25619c5a24b745	2015-01-01	0.151367	44.266931
1	038af19179925da21a25619c5a24b745	2015-02-01	0.151367	44.266931
2	038af19179925da21a25619c5a24b745	2015-03-01	0.151367	44.266931
3	038af19179925da21a25619c5a24b745	2015-04-01	0.149626	44.266931
4	038af19179925da21a25619c5a24b745	2015-05-01	0.149626	44.266931

Lets get the aggregations for the standard deviation metrics for each customer, which is an indicator of volatility

In [28]:

Copied!

cols = ['price_off_peak_var','price_peak_var','price_mid_peak_var',
        'price_off_peak_fix','price_peak_fix','price_mid_peak_fix']

cols_diff = ['diff_price_off_peak_var','diff_price_peak_var','diff_price_mid_peak_var',
             'diff_price_off_peak_fix','diff_price_peak_fix','diff_price_mid_peak_fix']

cols_diff_cumsum = ['cdiff_price_off_peak_var','cdiff_price_peak_var','cdiff_price_mid_peak_var',
                    'cdiff_price_off_peak_fix','cdiff_price_peak_fix','cdiff_price_mid_peak_fix']

std_prices = prices.groupby('id')[cols].std().compute()
cols = ['price_off_peak_var','price_peak_var','price_mid_peak_var',
        'price_off_peak_fix','price_peak_fix','price_mid_peak_fix']

cols_diff = ['diff_price_off_peak_var','diff_price_peak_var','diff_price_mid_peak_var',
             'diff_price_off_peak_fix','diff_price_peak_fix','diff_price_mid_peak_fix']

cols_diff_cumsum = ['cdiff_price_off_peak_var','cdiff_price_peak_var','cdiff_price_mid_peak_var',
                    'cdiff_price_off_peak_fix','cdiff_price_peak_fix','cdiff_price_mid_peak_fix']

std_prices = prices.groupby('id')[cols].std().compute()

In [29]:

Copied!

client_churn = client[['id','churn']].copy()
client_churn = client[['id','churn']].copy()

In [30]:

Copied!

client_churn_stats = client_churn.merge(std_prices,left_on='id',right_index=True).compute()
client_churn_stats.index = client_churn_stats['id']
client_churn_stats = client_churn_stats.drop(['id'],axis=1)
client_churn_stats = client_churn.merge(std_prices,left_on='id',right_index=True).compute()
client_churn_stats.index = client_churn_stats['id']
client_churn_stats = client_churn_stats.drop(['id'],axis=1)

In [31]:

Copied!

client_churn_stats.corr().round(2).loc['churn'].to_frame()
client_churn_stats.corr().round(2).loc['churn'].to_frame()

Out[31]:

	churn
churn	1.00
price_off_peak_var	0.04
price_peak_var	0.02
price_mid_peak_var	0.02
price_off_peak_fix	0.02
price_peak_fix	0.02
price_mid_peak_fix	0.01

5.2 | Price Hike Counters¶

We can also find the difference between each month's value & count how many times the prices went up for each of the metrics in the dataset for each of PowerCo's customers

In [32]:

Copied!

for col_diff,col in zip(cols_diff,cols):
    prices[col_diff] = prices[col].diff()
for col_diff,col in zip(cols_diff,cols):
    prices[col_diff] = prices[col].diff()

In [33]:

Copied!





# count the number of times there have been price hi
diff_price_off_peak_var_counter = prices.groupby('id').apply(lambda x: (x['diff_price_off_peak_var'] > 0).sum()).compute()
diff_price_peak_var_counter = prices.groupby('id').apply(lambda x: (x['diff_price_peak_var'] > 0).sum()).compute()
diff_price_mid_peak_var_counter = prices.groupby('id').apply(lambda x: (x['diff_price_mid_peak_var'] > 0).sum()).compute()

diff_price_off_peak_fix_counter = prices.groupby('id').apply(lambda x: (x['diff_price_off_peak_fix'] > 0).sum()).compute()
diff_price_peak_fix_counter = prices.groupby('id').apply(lambda x: (x['diff_price_peak_fix'] > 0).sum()).compute()
diff_price_mid_peak_fix_counter = prices.groupby('id').apply(lambda x: (x['diff_price_mid_peak_fix'] > 0).sum()).compute()
# count the number of times there have been price hi
diff_price_off_peak_var_counter = prices.groupby('id').apply(lambda x: (x['diff_price_off_peak_var'] > 0).sum()).compute()
diff_price_peak_var_counter = prices.groupby('id').apply(lambda x: (x['diff_price_peak_var'] > 0).sum()).compute()
diff_price_mid_peak_var_counter = prices.groupby('id').apply(lambda x: (x['diff_price_mid_peak_var'] > 0).sum()).compute()

diff_price_off_peak_fix_counter = prices.groupby('id').apply(lambda x: (x['diff_price_off_peak_fix'] > 0).sum()).compute()
diff_price_peak_fix_counter = prices.groupby('id').apply(lambda x: (x['diff_price_peak_fix'] > 0).sum()).compute()
diff_price_mid_peak_fix_counter = prices.groupby('id').apply(lambda x: (x['diff_price_mid_peak_fix'] > 0).sum()).compute()

In [34]:

Copied!

diff_price_mid_peak_fix_counter
diff_price_mid_peak_fix_counter

Out[34]:

id
0002203ffbb812588b632b9e628cc38d    2
0004351ebdd665e6ee664792efc4fd13    0
0010bcc39e42b3c2131ed2ce55246e3c    0
0010ee3855fdea87602a5b7aba8e42de    2
00114d74e963e47177db89bc70108537    0
                                   ..
ffef185810e44254c3a4c6395e6b4d8a    2
fffac626da707b1b5ab11e8431a4d0a2    0
fffc0cacd305dd51f316424bbb08d1bd    2
fffe4f5646aa39c7f97f95ae2679ce64    2
ffff7fa066f1fb305ae285bb03bf325a    2
Length: 16096, dtype: int64

In [35]:

Copied!





price_hikes = pd.concat([diff_price_off_peak_var_counter,diff_price_peak_var_counter,diff_price_mid_peak_var_counter,
                           diff_price_off_peak_fix_counter,diff_price_peak_fix_counter,diff_price_mid_peak_fix_counter],axis=1)
price_hikes.columns = cols_diff
price_hikes
price_hikes = pd.concat([diff_price_off_peak_var_counter,diff_price_peak_var_counter,diff_price_mid_peak_var_counter,
                           diff_price_off_peak_fix_counter,diff_price_peak_fix_counter,diff_price_mid_peak_fix_counter],axis=1)
price_hikes.columns = cols_diff
price_hikes

Out[35]:

	diff_price_off_peak_var	diff_price_peak_var	diff_price_mid_peak_var	diff_price_off_peak_fix	diff_price_peak_fix	diff_price_mid_peak_fix
id
0002203ffbb812588b632b9e628cc38d	1	2	2	1	2	2
0004351ebdd665e6ee664792efc4fd13	2	0	0	2	0	0
0010bcc39e42b3c2131ed2ce55246e3c	3	0	0	1	0	0
0010ee3855fdea87602a5b7aba8e42de	0	1	2	2	2	2
00114d74e963e47177db89bc70108537	2	0	0	1	0	0
...	...	...	...	...	...	...
ffef185810e44254c3a4c6395e6b4d8a	2	2	2	1	2	2
fffac626da707b1b5ab11e8431a4d0a2	2	0	0	1	0	0
fffc0cacd305dd51f316424bbb08d1bd	3	2	2	1	2	2
fffe4f5646aa39c7f97f95ae2679ce64	0	1	2	2	2	2
ffff7fa066f1fb305ae285bb03bf325a	0	1	2	2	2	2

16096 rows × 6 columns

In [36]:

Copied!

client_churn = client[['id','churn']].copy()
client_churn = client[['id','churn']].copy()

In [37]:

Copied!

client_churn_stats = client_churn.merge(price_hikes,left_on='id',right_index=True).compute()
client_churn_stats = client_churn.merge(price_hikes,left_on='id',right_index=True).compute()

In [38]:

Copied!

client_churn_stats['sum'] = client_churn_stats[cols_diff].min(axis=1)
client_churn_stats.index = client_churn_stats['id']
client_churn_stats = client_churn_stats.drop(['id'],axis=1)
client_churn_stats['sum'] = client_churn_stats[cols_diff].min(axis=1)
client_churn_stats.index = client_churn_stats['id']
client_churn_stats = client_churn_stats.drop(['id'],axis=1)

In [39]:

Copied!

client_churn_stats.corr().round(2).loc['churn'].to_frame()
client_churn_stats.corr().round(2).loc['churn'].to_frame()

Out[39]:

	churn
churn	1.00
diff_price_off_peak_var	-0.01
diff_price_peak_var	0.02
diff_price_mid_peak_var	0.05
diff_price_off_peak_fix	0.00
diff_price_peak_fix	0.04
diff_price_mid_peak_fix	0.04
sum	0.03

5.3 | Price Accumulation¶

Lastly, lets check how the accumulated prices impact the churn

In [40]:

Copied!

for cdiff,diff in zip(cols_diff_cumsum,cols_diff):
    prices[cdiff] = prices[diff].cumsum()
for cdiff,diff in zip(cols_diff_cumsum,cols_diff):
    prices[cdiff] = prices[diff].cumsum()

In [41]:

Copied!

prices.columns
prices.columns

Out[41]:

Index(['id', 'price_date', 'price_off_peak_var', 'price_peak_var',
       'price_mid_peak_var', 'price_off_peak_fix', 'price_peak_fix',
       'price_mid_peak_fix', 'diff_price_off_peak_var', 'diff_price_peak_var',
       'diff_price_mid_peak_var', 'diff_price_off_peak_fix',
       'diff_price_peak_fix', 'diff_price_mid_peak_fix',
       'cdiff_price_off_peak_var', 'cdiff_price_peak_var',
       'cdiff_price_mid_peak_var', 'cdiff_price_off_peak_fix',
       'cdiff_price_peak_fix', 'cdiff_price_mid_peak_fix'],
      dtype='object')

Accumulated difference in price variations for each customer

In [42]:

Copied!

pdprices = prices.compute()
client_churn = client[['id','churn']].compute().copy()
pdprices = prices.compute()
client_churn = client[['id','churn']].compute().copy()

In [43]:

Copied!

pdprices
pdprices

Out[43]:

	id	price_date	price_off_peak_var	price_peak_var	price_mid_peak_var	price_off_peak_fix	price_peak_fix	price_mid_peak_fix	diff_price_off_peak_var	diff_price_peak_var	diff_price_mid_peak_var	diff_price_off_peak_fix	diff_price_peak_fix	diff_price_mid_peak_fix	cdiff_price_off_peak_var	cdiff_price_peak_var	cdiff_price_mid_peak_var	cdiff_price_off_peak_fix	cdiff_price_peak_fix	cdiff_price_mid_peak_fix
0	038af19179925da21a25619c5a24b745	2015-01-01	0.151367	0.000000	0.000000	44.266931	0.00000	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	038af19179925da21a25619c5a24b745	2015-02-01	0.151367	0.000000	0.000000	44.266931	0.00000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	0.0	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000
2	038af19179925da21a25619c5a24b745	2015-03-01	0.151367	0.000000	0.000000	44.266931	0.00000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	0.0	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000
3	038af19179925da21a25619c5a24b745	2015-04-01	0.149626	0.000000	0.000000	44.266931	0.00000	0.000000	-0.001741	0.000000	0.000000	0.0	0.0	0.0	-0.001741	0.000000	0.000000	0.000000	0.00000	0.000000
4	038af19179925da21a25619c5a24b745	2015-05-01	0.149626	0.000000	0.000000	44.266931	0.00000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	0.0	-0.001741	0.000000	0.000000	0.000000	0.00000	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
192997	16f51cdc2baa19af0b940ee1b3dd17d5	2015-08-01	0.119916	0.102232	0.076257	40.728885	24.43733	16.291555	-0.008161	-0.004169	-0.000054	0.0	0.0	0.0	-0.031451	0.102232	0.076257	-3.538046	24.43733	16.291555
192998	16f51cdc2baa19af0b940ee1b3dd17d5	2015-09-01	0.119916	0.102232	0.076257	40.728885	24.43733	16.291555	0.000000	0.000000	0.000000	0.0	0.0	0.0	-0.031451	0.102232	0.076257	-3.538046	24.43733	16.291555
192999	16f51cdc2baa19af0b940ee1b3dd17d5	2015-10-01	0.119916	0.102232	0.076257	40.728885	24.43733	16.291555	0.000000	0.000000	0.000000	0.0	0.0	0.0	-0.031451	0.102232	0.076257	-3.538046	24.43733	16.291555
193000	16f51cdc2baa19af0b940ee1b3dd17d5	2015-11-01	0.119916	0.102232	0.076257	40.728885	24.43733	16.291555	0.000000	0.000000	0.000000	0.0	0.0	0.0	-0.031451	0.102232	0.076257	-3.538046	24.43733	16.291555
193001	16f51cdc2baa19af0b940ee1b3dd17d5	2015-12-01	0.119916	0.102232	0.076257	40.728885	24.43733	16.291555	0.000000	0.000000	0.000000	0.0	0.0	0.0	-0.031451	0.102232	0.076257	-3.538046	24.43733	16.291555

193002 rows × 20 columns

In [44]:

Copied!





diff_price_off_peak_var_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_off_peak_var']].last()
diff_price_peak_var_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_peak_var']].last()
diff_price_mid_peak_var_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_mid_peak_var']].last()

diff_price_off_peak_fix_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_off_peak_fix']].last()
diff_price_peak_fix_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_peak_fix']].last()
diff_price_mid_peak_fix_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_mid_peak_fix']].last()
diff_price_off_peak_var_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_off_peak_var']].last()
diff_price_peak_var_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_peak_var']].last()
diff_price_mid_peak_var_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_mid_peak_var']].last()

diff_price_off_peak_fix_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_off_peak_fix']].last()
diff_price_peak_fix_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_peak_fix']].last()
diff_price_mid_peak_fix_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_mid_peak_fix']].last()

In [45]:

Copied!





price_accumulations = pd.concat([diff_price_off_peak_var_cumsum,diff_price_peak_var_cumsum,diff_price_mid_peak_var_cumsum,
                                 diff_price_off_peak_fix_cumsum,diff_price_peak_fix_cumsum,diff_price_mid_peak_fix_cumsum],axis=1)
price_accumulations
# price_accumulations.corr().round(2).loc['churn'].to_frame()
price_accumulations = pd.concat([diff_price_off_peak_var_cumsum,diff_price_peak_var_cumsum,diff_price_mid_peak_var_cumsum,
                                 diff_price_off_peak_fix_cumsum,diff_price_peak_fix_cumsum,diff_price_mid_peak_fix_cumsum],axis=1)
price_accumulations
# price_accumulations.corr().round(2).loc['churn'].to_frame()

Out[45]:

	cdiff_price_off_peak_var	cdiff_price_peak_var	cdiff_price_mid_peak_var	cdiff_price_off_peak_fix	cdiff_price_peak_fix	cdiff_price_mid_peak_fix
id
0002203ffbb812588b632b9e628cc38d	-0.031461	0.101673	0.073719	-3.538046	24.43733	16.291555
0004351ebdd665e6ee664792efc4fd13	-0.007424	0.000000	0.000000	0.177779	0.00000	0.000000
0010bcc39e42b3c2131ed2ce55246e3c	0.049913	0.000000	0.000000	1.677779	0.00000	0.000000
0010ee3855fdea87602a5b7aba8e42de	-0.038299	0.095385	0.069409	-3.538046	24.43733	16.291555
00114d74e963e47177db89bc70108537	-0.005927	0.000000	0.000000	-0.000001	0.00000	0.000000
...	...	...	...	...	...	...
ffef185810e44254c3a4c6395e6b4d8a	-0.038879	0.094804	0.068829	-3.538046	24.43733	16.291555
fffac626da707b1b5ab11e8431a4d0a2	-0.006320	0.000000	0.000000	0.177779	0.00000	0.000000
fffc0cacd305dd51f316424bbb08d1bd	0.000032	0.126871	0.091394	-3.038046	24.93733	16.791555
fffe4f5646aa39c7f97f95ae2679ce64	-0.033192	0.100491	0.074516	-3.538046	24.43733	16.291555
ffff7fa066f1fb305ae285bb03bf325a	-0.031451	0.102232	0.076257	-3.538046	24.43733	16.291555

16096 rows × 6 columns

In [46]:

Copied!

client_churn_stats = client_churn.merge(price_accumulations,left_on='id',right_index=True)
client_churn_stats = client_churn_stats.drop(['id'],axis=1)
client_churn_stats
client_churn_stats = client_churn.merge(price_accumulations,left_on='id',right_index=True)
client_churn_stats = client_churn_stats.drop(['id'],axis=1)
client_churn_stats

Out[46]:

	churn	cdiff_price_off_peak_var	cdiff_price_peak_var	cdiff_price_mid_peak_var	cdiff_price_off_peak_fix	cdiff_price_peak_fix	cdiff_price_mid_peak_fix
0	1	-0.005334	0.085483	0.000000	-0.000001	0.000000	0.000000
1	0	-0.003767	0.000000	0.000000	0.177779	0.000000	0.000000
2	0	0.016431	0.088409	0.000000	0.177779	0.000000	0.000000
3	0	-0.002781	0.000000	0.000000	0.177779	0.000000	0.000000
4	0	-0.031461	0.101673	0.073719	-3.538046	24.437330	16.291555
...	...	...	...	...	...	...	...
14601	0	-0.011195	0.000000	0.000000	0.177779	0.000000	0.000000
14602	1	-0.049015	0.093181	0.070990	14.906537	36.490689	8.367731
14603	1	-0.031461	0.101673	0.073719	-3.538046	24.437330	16.291555
14604	0	-0.003767	0.000000	0.000000	0.177779	0.000000	0.000000
14605	0	0.014595	0.086905	0.000000	-0.000001	0.000000	0.000000

14606 rows × 7 columns

In [47]:

Copied!

client_churn_stats.corr().round(2).loc['churn'].to_frame()
client_churn_stats.corr().round(2).loc['churn'].to_frame()

Out[47]:

	churn
churn	1.00
cdiff_price_off_peak_var	-0.01
cdiff_price_peak_var	0.03
cdiff_price_mid_peak_var	0.05
cdiff_price_off_peak_fix	0.02
cdiff_price_peak_fix	0.05
cdiff_price_mid_peak_fix	0.04

6. Takeaways¶

Having been introduced to our new client in the first part

Coming out of the first meeting, we need to test a hypothesis raised during this meeting; "how much is price a factor in a customer’s choice"
We take a first look at the data that was presented to us by our client PowerCo, exploring both the data about the customers' clients as well as their pricings for the entire year 2015
The client has expressed concern about customers leaving, it was found that only 9.7% of the clients' customers have churned. A small portion, but enough for the client to raise concern

Churn Investigation¶

Feature relations

sales channel showed that about 60% of churned users came from this subgroup foosdfpfkusacimwkcsosbicdxkicaua. Out of the entire subset, this portion equated to roughly 12% of users.

84% of users who churned didn't have a gas contract has_gas with the same supplier

Insight into num_years_antig showes that churn tends to drop the longer a customer is a client, likewise gross margin on power subscription net margin on power subscription showed one of the higher linear correlations to churn, ie. the higher the values tend to be the more likelihood of a churn outcome

Overall, linear correlations of features tended to be quite weak to churn

The number of products nb_prod_act with our client PowerCo was more insightful, it was found that if the users had 5 products, the churn was 0

Hypothesis check for pricing

To check the brought up hypothesis, we decided to check a few extra for the pricing

standard deviations of the yearly prices tended to show very small correlation to churn, all indicating that if the pricing is not stable, tendecies to churn increased
Price hikes, for which we counted the number of times prices increased over the duration of the year all showed that curstomers tended to churn with increasing number of times prices were increased
Lastly, we checked the accumulating price sum for the entire year, overall the correlation didn't change too much either, showing a linear correlation of 0.05 at maximum

These factors tend to show that pricing and volatility aren't the most critical factors in a customer's decision to churn