import pandas as pd
from pandas import option_context
import dask.dataframe as dd
import os
Data Exploration¶
1. Background¶
1.1 | What is asked of us¶
Notebook Overview
- How to investigate whether price sensitivity is the most influential factor for a customer churning
- Conduct exploratory data analysis
Price sensitivity
is the degree to which demand changes when the cost of a product or service changes
- In the context of PowerCo, the “demand” refers to the demand for energy consumption
- Price sensitivity is commonly measured using the price elasticity of demand, which states that some consumers won't pay more if a lower-priced option is available
What is price elasticity
of demand?
- Price elasticity of demand is a measurement of the change in consumption of a product in relation to a change in its price
1.2 | What the client is handing over¶
Our client has sent over 3 data sets (shown below):
Historical customer
: Customer data such as usage, sign up date, forecasted usage etcHistorical pricing
: variable and fixed pricing data etcChurn indicator
: whether each customer has churned or not
You need to analyze the following things:
- The data types of each column
- Descriptive statistics of the dataset
- Distributions of columns
1.3 | File Previews¶
# !head -5 price_data.csv
!head -5 client_data.csv
id,channel_sales,cons_12m,cons_gas_12m,cons_last_month,date_activ,date_end,date_modif_prod,date_renewal,forecast_cons_12m,forecast_cons_year,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,has_gas,imp_cons,margin_gross_pow_ele,margin_net_pow_ele,nb_prod_act,net_margin,num_years_antig,origin_up,pow_max,churn 24011ae4ebbe3035111d65fa7c15bc57,foosdfpfkusacimwkcsosbicdxkicaua,0,54946,0,2013-06-15,2016-06-15,2015-11-01,2015-06-23,0.0,0,0.0,1.78,0.114481,0.098142,40.606701,t,0.0,25.44,25.44,2,678.99,3,lxidpiddsbxsbosboudacockeimpuepw,43.648,1 d29c2c54acc38ff3c0614d0a653813dd,MISSING,4660,0,0,2009-08-21,2016-08-30,2009-08-21,2015-08-31,189.95,0,0.0,16.27,0.1457109999999999,0.0,44.31137796,f,0.0,16.38,16.38,1,18.89,6,kamkkxfxxuwbdslkwifmmcsiusiuosws,13.8,0 764c75f661154dac3a6c254cd082ea7d,foosdfpfkusacimwkcsosbicdxkicaua,544,0,0,2010-04-16,2016-04-16,2010-04-16,2015-04-17,47.96,0,0.0,38.72,0.165794,0.087899,44.31137796,f,0.0,28.6,28.6,1,6.6,6,kamkkxfxxuwbdslkwifmmcsiusiuosws,13.856,0 bba03439a292a1e166f80264c16191cb,lmkebamcaaclubfxadlmueccxoimlema,1584,0,0,2010-03-30,2016-03-30,2010-03-30,2015-03-31,240.04,0,0.0,19.83,0.146694,0.0,44.31137796,f,0.0,30.22,30.22,1,25.46,6,kamkkxfxxuwbdslkwifmmcsiusiuosws,13.2,0
!ls . -l -h
total 23M -rw-rw-r-- 1 andrey andrey 3.4M Feb 8 17:14 client_data.csv -rw-rw-r-- 1 andrey andrey 18M Feb 8 17:14 price_data.csv -rw-rw-r-- 1 andrey andrey 70K Feb 8 18:23 'Task 2 - eda_starter.ipynb' -rw-rw-r-- 1 andrey andrey 285K Feb 10 00:47 Task2-edit.ipynb -rw-rw-r-- 1 andrey andrey 287K Feb 16 20:58 Task2.ipynb -rw-rw-r-- 1 andrey andrey 587K Feb 8 21:36 'Task 2 - Model Answer - EDA.ipynb'
path = 'task2'
clients_path = os.path.join('','client_data.csv')
prices_path = os.path.join('','price_data.csv')
2. Dataset Preview¶
2.1 | Clients Dataset¶
Basic problem setup is a customer churn one, we have information about each unique customers of PowerCo
activity_new
: Category of the company’s activitychannel_sales
: Code of the sales channelcons_12m
: Electricity consumption of the past 12 monthscons_gas_12m
: Gas consumption of the past 12 monthscons_last_month
: Electricity consumption of the last monthdate_activ
: Date of activation of the contractdate_end
: Registered date of the end of the contractdate_modif_prod
: Date of the last modification of the productdate_renewal
: Date of the next contract renewalforecast_cons_12m
: Forecasted electricity consumption for next 12 monthsforecast_cons_year
: Forecasted electricity consumption for the next calendar yearforecast_discount_energy
: Forecasted value of current discountforecast_meter_rent_12m
: Forecasted bill of meter rental for the next 2 monthsforecast_price_energy_off_peak
: Forecasted energy price for 1st period (off peak)forecast_price_energy_peak
: Forecasted energy price for 2nd period (peak)forecast_price_pow_off_peak
: Forecasted power price for 1st period (off peak)has_gas
: Indicated if client is also a gas clientimp_cons
: Current paid consumptionmargin_gross_pow_ele
: Gross margin on power subscriptionmargin_net_pow_ele
: Net margin on power subscriptionnb_prod_act
: Number of active products and servicesnet_margin
: Total net marginnum_years_antig
: Antiquity of the client (in number of years)origin_up
: Code of the electricity campaign the customer first subscribed topow_max
: Subscribed powerchurn
: Has the client churned over the next 3 months
# each row corresponds to data about the unique customer
client = dd.read_csv(clients_path,sep=',', blocksize="30M")
The column data types are summarised below
client.dtypes
id string[pyarrow] channel_sales string[pyarrow] cons_12m int64 cons_gas_12m int64 cons_last_month int64 date_activ string[pyarrow] date_end string[pyarrow] date_modif_prod string[pyarrow] date_renewal string[pyarrow] forecast_cons_12m float64 forecast_cons_year int64 forecast_discount_energy float64 forecast_meter_rent_12m float64 forecast_price_energy_off_peak float64 forecast_price_energy_peak float64 forecast_price_pow_off_peak float64 has_gas string[pyarrow] imp_cons float64 margin_gross_pow_ele float64 margin_net_pow_ele float64 nb_prod_act int64 net_margin float64 num_years_antig int64 origin_up string[pyarrow] pow_max float64 churn int64 dtype: object
We have a number of categorical string features which we can use to investigate why clients are churning
# show all rows in the dataframe
with option_context('display.max_columns', None):
display(client.head(5))
id | channel_sales | cons_12m | cons_gas_12m | cons_last_month | date_activ | date_end | date_modif_prod | date_renewal | forecast_cons_12m | forecast_cons_year | forecast_discount_energy | forecast_meter_rent_12m | forecast_price_energy_off_peak | forecast_price_energy_peak | forecast_price_pow_off_peak | has_gas | imp_cons | margin_gross_pow_ele | margin_net_pow_ele | nb_prod_act | net_margin | num_years_antig | origin_up | pow_max | churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24011ae4ebbe3035111d65fa7c15bc57 | foosdfpfkusacimwkcsosbicdxkicaua | 0 | 54946 | 0 | 2013-06-15 | 2016-06-15 | 2015-11-01 | 2015-06-23 | 0.00 | 0 | 0.0 | 1.78 | 0.114481 | 0.098142 | 40.606701 | t | 0.00 | 25.44 | 25.44 | 2 | 678.99 | 3 | lxidpiddsbxsbosboudacockeimpuepw | 43.648 | 1 |
1 | d29c2c54acc38ff3c0614d0a653813dd | MISSING | 4660 | 0 | 0 | 2009-08-21 | 2016-08-30 | 2009-08-21 | 2015-08-31 | 189.95 | 0 | 0.0 | 16.27 | 0.145711 | 0.000000 | 44.311378 | f | 0.00 | 16.38 | 16.38 | 1 | 18.89 | 6 | kamkkxfxxuwbdslkwifmmcsiusiuosws | 13.800 | 0 |
2 | 764c75f661154dac3a6c254cd082ea7d | foosdfpfkusacimwkcsosbicdxkicaua | 544 | 0 | 0 | 2010-04-16 | 2016-04-16 | 2010-04-16 | 2015-04-17 | 47.96 | 0 | 0.0 | 38.72 | 0.165794 | 0.087899 | 44.311378 | f | 0.00 | 28.60 | 28.60 | 1 | 6.60 | 6 | kamkkxfxxuwbdslkwifmmcsiusiuosws | 13.856 | 0 |
3 | bba03439a292a1e166f80264c16191cb | lmkebamcaaclubfxadlmueccxoimlema | 1584 | 0 | 0 | 2010-03-30 | 2016-03-30 | 2010-03-30 | 2015-03-31 | 240.04 | 0 | 0.0 | 19.83 | 0.146694 | 0.000000 | 44.311378 | f | 0.00 | 30.22 | 30.22 | 1 | 25.46 | 6 | kamkkxfxxuwbdslkwifmmcsiusiuosws | 13.200 | 0 |
4 | 149d57cf92fc41cf94415803a877cb4b | MISSING | 4425 | 0 | 526 | 2010-01-13 | 2016-03-07 | 2010-01-13 | 2015-03-09 | 445.75 | 526 | 0.0 | 131.73 | 0.116900 | 0.100015 | 40.606701 | f | 52.32 | 44.91 | 44.91 | 1 | 47.98 | 6 | kamkkxfxxuwbdslkwifmmcsiusiuosws | 19.800 | 0 |
# number of rows
client['id'].count().compute()
14606
client[client['id'] == '038af19179925da21a25619c5a24b745'].compute()
id | channel_sales | cons_12m | cons_gas_12m | cons_last_month | date_activ | date_end | date_modif_prod | date_renewal | forecast_cons_12m | ... | has_gas | imp_cons | margin_gross_pow_ele | margin_net_pow_ele | nb_prod_act | net_margin | num_years_antig | origin_up | pow_max | churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11047 | 038af19179925da21a25619c5a24b745 | foosdfpfkusacimwkcsosbicdxkicaua | 3576 | 0 | 630 | 2013-06-03 | 2016-06-03 | 2015-04-29 | 2015-06-14 | 531.5 | ... | f | 91.82 | 21.52 | 21.52 | 1 | 52.53 | 3 | ldkssxwpmemidmecebumciepifcamkci | 13.2 | 0 |
1 rows × 26 columns
2.2 | Electiricty Prices Data¶
Dataset contains for each of PowerCo's clients, the pricing variation for each month
id
= client company identifierprice_date
= reference dateprice_off_peak_var
= price of energy for the 1st period (off peak)price_peak_var
= price of energy for the 2nd period (peak)price_mid_peak_var
= price of energy for the 3rd period (mid peak)price_off_peak_fix
= price of power for the 1st period (off peak)price_peak_fix
= price of power for the 2nd period (peak)price_mid_peak_fix
= price of power for the 3rd period (mid peak)
This data will allow us to check the hypothesis that was asked of us, we'll do that in a later section
prices = dd.read_csv(prices_path,sep=',',blocksize="30M")
prices.head()
id | price_date | price_off_peak_var | price_peak_var | price_mid_peak_var | price_off_peak_fix | price_peak_fix | price_mid_peak_fix | |
---|---|---|---|---|---|---|---|---|
0 | 038af19179925da21a25619c5a24b745 | 2015-01-01 | 0.151367 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
1 | 038af19179925da21a25619c5a24b745 | 2015-02-01 | 0.151367 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
2 | 038af19179925da21a25619c5a24b745 | 2015-03-01 | 0.151367 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
3 | 038af19179925da21a25619c5a24b745 | 2015-04-01 | 0.149626 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
4 | 038af19179925da21a25619c5a24b745 | 2015-05-01 | 0.149626 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
prices['id'].count().compute()
193002
Looks like we have pricing data for each of PowerCo's clients for the year 2015
# prices for each customer by month
prices[prices['id'] == '038af19179925da21a25619c5a24b745'].compute()
id | price_date | price_off_peak_var | price_peak_var | price_mid_peak_var | price_off_peak_fix | price_peak_fix | price_mid_peak_fix | |
---|---|---|---|---|---|---|---|---|
0 | 038af19179925da21a25619c5a24b745 | 2015-01-01 | 0.151367 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
1 | 038af19179925da21a25619c5a24b745 | 2015-02-01 | 0.151367 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
2 | 038af19179925da21a25619c5a24b745 | 2015-03-01 | 0.151367 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
3 | 038af19179925da21a25619c5a24b745 | 2015-04-01 | 0.149626 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
4 | 038af19179925da21a25619c5a24b745 | 2015-05-01 | 0.149626 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
5 | 038af19179925da21a25619c5a24b745 | 2015-06-01 | 0.149626 | 0.0 | 0.0 | 44.266930 | 0.0 | 0.0 |
6 | 038af19179925da21a25619c5a24b745 | 2015-07-01 | 0.150321 | 0.0 | 0.0 | 44.444710 | 0.0 | 0.0 |
7 | 038af19179925da21a25619c5a24b745 | 2015-08-01 | 0.145859 | 0.0 | 0.0 | 44.444710 | 0.0 | 0.0 |
8 | 038af19179925da21a25619c5a24b745 | 2015-09-01 | 0.145859 | 0.0 | 0.0 | 44.444710 | 0.0 | 0.0 |
9 | 038af19179925da21a25619c5a24b745 | 2015-10-01 | 0.145859 | 0.0 | 0.0 | 44.444710 | 0.0 | 0.0 |
10 | 038af19179925da21a25619c5a24b745 | 2015-11-01 | 0.145859 | 0.0 | 0.0 | 44.444710 | 0.0 | 0.0 |
11 | 038af19179925da21a25619c5a24b745 | 2015-12-01 | 0.145859 | 0.0 | 0.0 | 44.444710 | 0.0 | 0.0 |
3. Descriptive Statistics¶
Lets look at some of the basic univariate statistics and some more in depth relations that we can find in the data
3.1 | What we'll do¶
- We want to first do some on the surface statistic investigations and ultimately,
- we'll need to answer questions regarding what influences
churn
of customers - First lets check the percentage of clients who have churned in the next 3 months
(client['churn'].value_counts(normalize=True)*100).compute()
churn 0 90.284814 1 9.715186 Name: proportion, dtype: float64
3.2 | Customer data¶
client.describe().compute().round(3)
cons_12m | cons_gas_12m | cons_last_month | forecast_cons_12m | forecast_cons_year | forecast_discount_energy | forecast_meter_rent_12m | forecast_price_energy_off_peak | forecast_price_energy_peak | forecast_price_pow_off_peak | imp_cons | margin_gross_pow_ele | margin_net_pow_ele | nb_prod_act | net_margin | num_years_antig | pow_max | churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 | 14606.000 |
mean | 159220.286 | 28092.375 | 16090.270 | 1868.615 | 1399.763 | 0.967 | 63.087 | 0.137 | 0.050 | 43.130 | 152.787 | 24.565 | 24.563 | 1.292 | 189.265 | 4.998 | 18.135 | 0.097 |
std | 573465.264 | 162973.059 | 64364.196 | 2387.572 | 3247.786 | 5.108 | 66.166 | 0.025 | 0.049 | 4.486 | 341.369 | 20.231 | 20.230 | 0.710 | 311.798 | 1.612 | 13.535 | 0.296 |
min | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 1.000 | 3.300 | 0.000 |
25% | 5674.750 | 0.000 | 0.000 | 494.995 | 0.000 | 0.000 | 16.180 | 0.116 | 0.000 | 40.607 | 0.000 | 14.280 | 14.280 | 1.000 | 50.712 | 4.000 | 12.500 | 0.000 |
50% | 14115.500 | 0.000 | 792.500 | 1112.875 | 314.000 | 0.000 | 18.795 | 0.143 | 0.084 | 44.311 | 37.395 | 21.640 | 21.640 | 1.000 | 112.530 | 5.000 | 13.856 | 0.000 |
75% | 40763.750 | 0.000 | 3383.000 | 2401.790 | 1745.750 | 0.000 | 131.030 | 0.146 | 0.099 | 44.311 | 193.980 | 29.880 | 29.880 | 1.000 | 243.098 | 6.000 | 19.172 | 0.000 |
max | 6207104.000 | 4154590.000 | 771203.000 | 82902.830 | 175375.000 | 30.000 | 599.310 | 0.274 | 0.196 | 59.266 | 15042.790 | 374.640 | 374.640 | 32.000 | 24570.650 | 13.000 | 320.000 | 1.000 |
- If we look at the number of churned clients, they are quite low; 9.72%
- Nevertheless, our client has raised concer so we need to investigate what affects this churn
(client['churn'].value_counts(normalize=True).compute() * 100).round(2)
churn 0 90.28 1 9.72 Name: proportion, dtype: float64
3.3 | Price Data¶
Lets look at the statistics for each of the pricing columns for all customers
prices.describe().compute().round(3).T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
price_off_peak_var | 193002.0 | 0.141 | 0.025 | 0.0 | 0.126 | 0.146 | 0.152 | 0.281 |
price_peak_var | 193002.0 | 0.055 | 0.050 | 0.0 | 0.000 | 0.085 | 0.102 | 0.230 |
price_mid_peak_var | 193002.0 | 0.030 | 0.036 | 0.0 | 0.000 | 0.000 | 0.073 | 0.114 |
price_off_peak_fix | 193002.0 | 43.334 | 5.410 | 0.0 | 40.729 | 44.267 | 44.445 | 59.445 |
price_peak_fix | 193002.0 | 10.623 | 12.842 | 0.0 | 0.000 | 0.000 | 24.340 | 36.491 |
price_mid_peak_fix | 193002.0 | 6.410 | 7.774 | 0.0 | 0.000 | 0.000 | 16.226 | 17.458 |
4. What influences churn¶
Lets investigate what factors affect the churn
, and relay them via statistics and visualisations
- We'll be limiting our investigation to customer features only
4.1 | Sales Channel¶
- Sales channel doesn't seem to be a linear features that affects churn
- Some groups which affect churn more than others, notably group
foosdfpfkusacimwkcsosbicdxkicaua
make up about half of the churns
churned = client[client['churn'] == 1]
notchurn = client[client['churn'] == 0]
Lets look at the numbers relative to the churned clients
churn_counts = churned.groupby(['channel_sales']).agg(counts=('id','count')).compute()
churn_counts['count%'] = ((churn_counts['counts']/churn_counts['counts'].sum()) * 100).round(2)
churn_counts = churn_counts.reset_index()
churn_counts.sort_values(by='count%',ascending=False)
channel_sales | counts | count% | |
---|---|---|---|
0 | foosdfpfkusacimwkcsosbicdxkicaua | 820 | 57.79 |
2 | MISSING | 283 | 19.94 |
1 | usilxuppasemubllopkaafesmlibmsdf | 138 | 9.73 |
3 | lmkebamcaaclubfxadlmueccxoimlema | 103 | 7.26 |
4 | ewpakwlliwisiwduibdlfmalxowmwpci | 75 | 5.29 |
- Now lets look at relative to each group in channel_sales
- Lets also create a helper function which allows us to compare percentage wise the variation of churned to not churned clients
def compare_percentage(column:str,df:pd.DataFrame=client):
churn_counts = df.groupby([column,'churn']).agg(counts=('id','count')).compute()
churn_counts = churn_counts.reset_index()
all_counts = client.groupby([column]).agg(counts=('id','count')).compute()
merged = churn_counts.merge(all_counts,left_on=column,right_index=True)
merged['percent%'] = ((merged['counts_x']/merged['counts_y'])*100).round(2)
merged = merged.groupby([column,'churn'])['percent%'].mean().to_frame().sort_index()
return merged
compare_percentage('channel_sales')
percent% | ||
---|---|---|
channel_sales | churn | |
MISSING | 0 | 92.40 |
1 | 7.60 | |
epumfxlbckeskwekxbiuasklxalciiuu | 0 | 100.00 |
ewpakwlliwisiwduibdlfmalxowmwpci | 0 | 91.60 |
1 | 8.40 | |
fixdbufsefwooaasfcxdxadsiekoceaa | 0 | 100.00 |
foosdfpfkusacimwkcsosbicdxkicaua | 0 | 87.86 |
1 | 12.14 | |
lmkebamcaaclubfxadlmueccxoimlema | 0 | 94.41 |
1 | 5.59 | |
sddiedcslfslkckwlfkdpoeeailfpeds | 0 | 100.00 |
usilxuppasemubllopkaafesmlibmsdf | 0 | 89.96 |
1 | 10.04 |
4.2 | Gas Active Clients¶
- One other features we should look at specifies whether the client also a gas client
- We can note that if the clients are also gas customers, the likelihood of them churning is substantially smaller
- This would definitely be a point to raise to prevent current users from switching to other companies
gas_counts = churned.groupby(['has_gas']).agg(counts=('id','count')).compute()
gas_counts['count%'] = ((gas_counts['counts']/gas_counts['counts'].sum()) * 100).round(2)
gas_counts.sort_values(by='count%',ascending=False)
counts | count% | |
---|---|---|
has_gas | ||
f | 1202 | 84.71 |
t | 217 | 15.29 |
compare_percentage('has_gas')
percent% | ||
---|---|---|
has_gas | churn | |
f | 0 | 89.95 |
1 | 10.05 | |
t | 0 | 91.81 |
1 | 8.19 |
4.3 | Correlation to Churn¶
- We can also compare how the number of existing customer products affects their choices amonstt other numeric columns
- Logically, the higher the number of products a customer has, the lower the changes of them churning
import warnings; warnings.filterwarnings('ignore')
client.select_dtypes(include=['float64','int64']).corr().round(2).compute().loc['churn'].to_frame()
churn | |
---|---|
cons_12m | -0.05 |
cons_gas_12m | -0.04 |
cons_last_month | -0.05 |
forecast_cons_12m | 0.01 |
forecast_cons_year | -0.00 |
forecast_discount_energy | 0.02 |
forecast_meter_rent_12m | 0.04 |
forecast_price_energy_off_peak | -0.01 |
forecast_price_energy_peak | 0.03 |
forecast_price_pow_off_peak | 0.01 |
imp_cons | -0.00 |
margin_gross_pow_ele | 0.10 |
margin_net_pow_ele | 0.10 |
nb_prod_act | -0.01 |
net_margin | 0.04 |
num_years_antig | -0.07 |
pow_max | 0.03 |
churn | 1.00 |
- From the results we can see that most numeric features are very mildly linearly correlated to churn
- This metric doesn't quite give us enough detail about the effects of the number of products
4.4 | Number of Products¶
Lets check is more detail, if indeed there is any relation to churn
- We can see that beyond 5 products, the churn percentage is 0
- Anything below this number is about the same
client['nb_prod_act_cat'] = client['nb_prod_act'].astype('category')
compare_percentage('nb_prod_act_cat')
percent% | ||
---|---|---|
nb_prod_act_cat | churn | |
1 | 0 | 90.02 |
1 | 9.98 | |
2 | 0 | 91.49 |
1 | 8.51 | |
3 | 0 | 90.06 |
1 | 9.94 | |
4 | 0 | 90.00 |
1 | 10.00 | |
5 | 0 | 90.32 |
1 | 9.68 | |
6 | 0 | 100.00 |
1 | 0.00 | |
8 | 0 | 100.00 |
1 | 0.00 | |
9 | 0 | 100.00 |
1 | 0.00 | |
10 | 0 | 100.00 |
1 | 0.00 | |
32 | 0 | 100.00 |
1 | 0.00 |
5. Hypothesis Investigation¶
We looked at various factors that can affect churn
from the customer dataset. Lets now check the pricing for each customer and check the customers sensitivity to pricing.
5.1 | Standard Deviations¶
Since we have the consumption data for each of the companies for the year of 2015, we will create new features to measure price sensitivity
prices.head()
id | price_date | price_off_peak_var | price_peak_var | price_mid_peak_var | price_off_peak_fix | price_peak_fix | price_mid_peak_fix | |
---|---|---|---|---|---|---|---|---|
0 | 038af19179925da21a25619c5a24b745 | 2015-01-01 | 0.151367 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
1 | 038af19179925da21a25619c5a24b745 | 2015-02-01 | 0.151367 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
2 | 038af19179925da21a25619c5a24b745 | 2015-03-01 | 0.151367 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
3 | 038af19179925da21a25619c5a24b745 | 2015-04-01 | 0.149626 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
4 | 038af19179925da21a25619c5a24b745 | 2015-05-01 | 0.149626 | 0.0 | 0.0 | 44.266931 | 0.0 | 0.0 |
Lets get the aggregations for the standard deviation metrics for each customer, which is an indicator of volatility
cols = ['price_off_peak_var','price_peak_var','price_mid_peak_var',
'price_off_peak_fix','price_peak_fix','price_mid_peak_fix']
cols_diff = ['diff_price_off_peak_var','diff_price_peak_var','diff_price_mid_peak_var',
'diff_price_off_peak_fix','diff_price_peak_fix','diff_price_mid_peak_fix']
cols_diff_cumsum = ['cdiff_price_off_peak_var','cdiff_price_peak_var','cdiff_price_mid_peak_var',
'cdiff_price_off_peak_fix','cdiff_price_peak_fix','cdiff_price_mid_peak_fix']
std_prices = prices.groupby('id')[cols].std().compute()
client_churn = client[['id','churn']].copy()
client_churn_stats = client_churn.merge(std_prices,left_on='id',right_index=True).compute()
client_churn_stats.index = client_churn_stats['id']
client_churn_stats = client_churn_stats.drop(['id'],axis=1)
client_churn_stats.corr().round(2).loc['churn'].to_frame()
churn | |
---|---|
churn | 1.00 |
price_off_peak_var | 0.04 |
price_peak_var | 0.02 |
price_mid_peak_var | 0.02 |
price_off_peak_fix | 0.02 |
price_peak_fix | 0.02 |
price_mid_peak_fix | 0.01 |
5.2 | Price Hike Counters¶
We can also find the difference between each month's value & count how many times the prices went up for each of the metrics in the dataset for each of PowerCo's customers
for col_diff,col in zip(cols_diff,cols):
prices[col_diff] = prices[col].diff()
# count the number of times there have been price hi
diff_price_off_peak_var_counter = prices.groupby('id').apply(lambda x: (x['diff_price_off_peak_var'] > 0).sum()).compute()
diff_price_peak_var_counter = prices.groupby('id').apply(lambda x: (x['diff_price_peak_var'] > 0).sum()).compute()
diff_price_mid_peak_var_counter = prices.groupby('id').apply(lambda x: (x['diff_price_mid_peak_var'] > 0).sum()).compute()
diff_price_off_peak_fix_counter = prices.groupby('id').apply(lambda x: (x['diff_price_off_peak_fix'] > 0).sum()).compute()
diff_price_peak_fix_counter = prices.groupby('id').apply(lambda x: (x['diff_price_peak_fix'] > 0).sum()).compute()
diff_price_mid_peak_fix_counter = prices.groupby('id').apply(lambda x: (x['diff_price_mid_peak_fix'] > 0).sum()).compute()
diff_price_mid_peak_fix_counter
id 0002203ffbb812588b632b9e628cc38d 2 0004351ebdd665e6ee664792efc4fd13 0 0010bcc39e42b3c2131ed2ce55246e3c 0 0010ee3855fdea87602a5b7aba8e42de 2 00114d74e963e47177db89bc70108537 0 .. ffef185810e44254c3a4c6395e6b4d8a 2 fffac626da707b1b5ab11e8431a4d0a2 0 fffc0cacd305dd51f316424bbb08d1bd 2 fffe4f5646aa39c7f97f95ae2679ce64 2 ffff7fa066f1fb305ae285bb03bf325a 2 Length: 16096, dtype: int64
price_hikes = pd.concat([diff_price_off_peak_var_counter,diff_price_peak_var_counter,diff_price_mid_peak_var_counter,
diff_price_off_peak_fix_counter,diff_price_peak_fix_counter,diff_price_mid_peak_fix_counter],axis=1)
price_hikes.columns = cols_diff
price_hikes
diff_price_off_peak_var | diff_price_peak_var | diff_price_mid_peak_var | diff_price_off_peak_fix | diff_price_peak_fix | diff_price_mid_peak_fix | |
---|---|---|---|---|---|---|
id | ||||||
0002203ffbb812588b632b9e628cc38d | 1 | 2 | 2 | 1 | 2 | 2 |
0004351ebdd665e6ee664792efc4fd13 | 2 | 0 | 0 | 2 | 0 | 0 |
0010bcc39e42b3c2131ed2ce55246e3c | 3 | 0 | 0 | 1 | 0 | 0 |
0010ee3855fdea87602a5b7aba8e42de | 0 | 1 | 2 | 2 | 2 | 2 |
00114d74e963e47177db89bc70108537 | 2 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... |
ffef185810e44254c3a4c6395e6b4d8a | 2 | 2 | 2 | 1 | 2 | 2 |
fffac626da707b1b5ab11e8431a4d0a2 | 2 | 0 | 0 | 1 | 0 | 0 |
fffc0cacd305dd51f316424bbb08d1bd | 3 | 2 | 2 | 1 | 2 | 2 |
fffe4f5646aa39c7f97f95ae2679ce64 | 0 | 1 | 2 | 2 | 2 | 2 |
ffff7fa066f1fb305ae285bb03bf325a | 0 | 1 | 2 | 2 | 2 | 2 |
16096 rows × 6 columns
client_churn = client[['id','churn']].copy()
client_churn_stats = client_churn.merge(price_hikes,left_on='id',right_index=True).compute()
client_churn_stats['sum'] = client_churn_stats[cols_diff].min(axis=1)
client_churn_stats.index = client_churn_stats['id']
client_churn_stats = client_churn_stats.drop(['id'],axis=1)
client_churn_stats.corr().round(2).loc['churn'].to_frame()
churn | |
---|---|
churn | 1.00 |
diff_price_off_peak_var | -0.01 |
diff_price_peak_var | 0.02 |
diff_price_mid_peak_var | 0.05 |
diff_price_off_peak_fix | 0.00 |
diff_price_peak_fix | 0.04 |
diff_price_mid_peak_fix | 0.04 |
sum | 0.03 |
5.3 | Price Accumulation¶
Lastly, lets check how the accumulated prices impact the churn
for cdiff,diff in zip(cols_diff_cumsum,cols_diff):
prices[cdiff] = prices[diff].cumsum()
prices.columns
Index(['id', 'price_date', 'price_off_peak_var', 'price_peak_var', 'price_mid_peak_var', 'price_off_peak_fix', 'price_peak_fix', 'price_mid_peak_fix', 'diff_price_off_peak_var', 'diff_price_peak_var', 'diff_price_mid_peak_var', 'diff_price_off_peak_fix', 'diff_price_peak_fix', 'diff_price_mid_peak_fix', 'cdiff_price_off_peak_var', 'cdiff_price_peak_var', 'cdiff_price_mid_peak_var', 'cdiff_price_off_peak_fix', 'cdiff_price_peak_fix', 'cdiff_price_mid_peak_fix'], dtype='object')
Accumulated difference in price variations for each customer
pdprices = prices.compute()
client_churn = client[['id','churn']].compute().copy()
pdprices
id | price_date | price_off_peak_var | price_peak_var | price_mid_peak_var | price_off_peak_fix | price_peak_fix | price_mid_peak_fix | diff_price_off_peak_var | diff_price_peak_var | diff_price_mid_peak_var | diff_price_off_peak_fix | diff_price_peak_fix | diff_price_mid_peak_fix | cdiff_price_off_peak_var | cdiff_price_peak_var | cdiff_price_mid_peak_var | cdiff_price_off_peak_fix | cdiff_price_peak_fix | cdiff_price_mid_peak_fix | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 038af19179925da21a25619c5a24b745 | 2015-01-01 | 0.151367 | 0.000000 | 0.000000 | 44.266931 | 0.00000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 038af19179925da21a25619c5a24b745 | 2015-02-01 | 0.151367 | 0.000000 | 0.000000 | 44.266931 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 |
2 | 038af19179925da21a25619c5a24b745 | 2015-03-01 | 0.151367 | 0.000000 | 0.000000 | 44.266931 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 |
3 | 038af19179925da21a25619c5a24b745 | 2015-04-01 | 0.149626 | 0.000000 | 0.000000 | 44.266931 | 0.00000 | 0.000000 | -0.001741 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | -0.001741 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 |
4 | 038af19179925da21a25619c5a24b745 | 2015-05-01 | 0.149626 | 0.000000 | 0.000000 | 44.266931 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | -0.001741 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
192997 | 16f51cdc2baa19af0b940ee1b3dd17d5 | 2015-08-01 | 0.119916 | 0.102232 | 0.076257 | 40.728885 | 24.43733 | 16.291555 | -0.008161 | -0.004169 | -0.000054 | 0.0 | 0.0 | 0.0 | -0.031451 | 0.102232 | 0.076257 | -3.538046 | 24.43733 | 16.291555 |
192998 | 16f51cdc2baa19af0b940ee1b3dd17d5 | 2015-09-01 | 0.119916 | 0.102232 | 0.076257 | 40.728885 | 24.43733 | 16.291555 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | -0.031451 | 0.102232 | 0.076257 | -3.538046 | 24.43733 | 16.291555 |
192999 | 16f51cdc2baa19af0b940ee1b3dd17d5 | 2015-10-01 | 0.119916 | 0.102232 | 0.076257 | 40.728885 | 24.43733 | 16.291555 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | -0.031451 | 0.102232 | 0.076257 | -3.538046 | 24.43733 | 16.291555 |
193000 | 16f51cdc2baa19af0b940ee1b3dd17d5 | 2015-11-01 | 0.119916 | 0.102232 | 0.076257 | 40.728885 | 24.43733 | 16.291555 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | -0.031451 | 0.102232 | 0.076257 | -3.538046 | 24.43733 | 16.291555 |
193001 | 16f51cdc2baa19af0b940ee1b3dd17d5 | 2015-12-01 | 0.119916 | 0.102232 | 0.076257 | 40.728885 | 24.43733 | 16.291555 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | -0.031451 | 0.102232 | 0.076257 | -3.538046 | 24.43733 | 16.291555 |
193002 rows × 20 columns
diff_price_off_peak_var_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_off_peak_var']].last()
diff_price_peak_var_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_peak_var']].last()
diff_price_mid_peak_var_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_mid_peak_var']].last()
diff_price_off_peak_fix_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_off_peak_fix']].last()
diff_price_peak_fix_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_peak_fix']].last()
diff_price_mid_peak_fix_cumsum = pdprices.groupby('id',as_index=True)[['cdiff_price_mid_peak_fix']].last()
price_accumulations = pd.concat([diff_price_off_peak_var_cumsum,diff_price_peak_var_cumsum,diff_price_mid_peak_var_cumsum,
diff_price_off_peak_fix_cumsum,diff_price_peak_fix_cumsum,diff_price_mid_peak_fix_cumsum],axis=1)
price_accumulations
# price_accumulations.corr().round(2).loc['churn'].to_frame()
cdiff_price_off_peak_var | cdiff_price_peak_var | cdiff_price_mid_peak_var | cdiff_price_off_peak_fix | cdiff_price_peak_fix | cdiff_price_mid_peak_fix | |
---|---|---|---|---|---|---|
id | ||||||
0002203ffbb812588b632b9e628cc38d | -0.031461 | 0.101673 | 0.073719 | -3.538046 | 24.43733 | 16.291555 |
0004351ebdd665e6ee664792efc4fd13 | -0.007424 | 0.000000 | 0.000000 | 0.177779 | 0.00000 | 0.000000 |
0010bcc39e42b3c2131ed2ce55246e3c | 0.049913 | 0.000000 | 0.000000 | 1.677779 | 0.00000 | 0.000000 |
0010ee3855fdea87602a5b7aba8e42de | -0.038299 | 0.095385 | 0.069409 | -3.538046 | 24.43733 | 16.291555 |
00114d74e963e47177db89bc70108537 | -0.005927 | 0.000000 | 0.000000 | -0.000001 | 0.00000 | 0.000000 |
... | ... | ... | ... | ... | ... | ... |
ffef185810e44254c3a4c6395e6b4d8a | -0.038879 | 0.094804 | 0.068829 | -3.538046 | 24.43733 | 16.291555 |
fffac626da707b1b5ab11e8431a4d0a2 | -0.006320 | 0.000000 | 0.000000 | 0.177779 | 0.00000 | 0.000000 |
fffc0cacd305dd51f316424bbb08d1bd | 0.000032 | 0.126871 | 0.091394 | -3.038046 | 24.93733 | 16.791555 |
fffe4f5646aa39c7f97f95ae2679ce64 | -0.033192 | 0.100491 | 0.074516 | -3.538046 | 24.43733 | 16.291555 |
ffff7fa066f1fb305ae285bb03bf325a | -0.031451 | 0.102232 | 0.076257 | -3.538046 | 24.43733 | 16.291555 |
16096 rows × 6 columns
client_churn_stats = client_churn.merge(price_accumulations,left_on='id',right_index=True)
client_churn_stats = client_churn_stats.drop(['id'],axis=1)
client_churn_stats
churn | cdiff_price_off_peak_var | cdiff_price_peak_var | cdiff_price_mid_peak_var | cdiff_price_off_peak_fix | cdiff_price_peak_fix | cdiff_price_mid_peak_fix | |
---|---|---|---|---|---|---|---|
0 | 1 | -0.005334 | 0.085483 | 0.000000 | -0.000001 | 0.000000 | 0.000000 |
1 | 0 | -0.003767 | 0.000000 | 0.000000 | 0.177779 | 0.000000 | 0.000000 |
2 | 0 | 0.016431 | 0.088409 | 0.000000 | 0.177779 | 0.000000 | 0.000000 |
3 | 0 | -0.002781 | 0.000000 | 0.000000 | 0.177779 | 0.000000 | 0.000000 |
4 | 0 | -0.031461 | 0.101673 | 0.073719 | -3.538046 | 24.437330 | 16.291555 |
... | ... | ... | ... | ... | ... | ... | ... |
14601 | 0 | -0.011195 | 0.000000 | 0.000000 | 0.177779 | 0.000000 | 0.000000 |
14602 | 1 | -0.049015 | 0.093181 | 0.070990 | 14.906537 | 36.490689 | 8.367731 |
14603 | 1 | -0.031461 | 0.101673 | 0.073719 | -3.538046 | 24.437330 | 16.291555 |
14604 | 0 | -0.003767 | 0.000000 | 0.000000 | 0.177779 | 0.000000 | 0.000000 |
14605 | 0 | 0.014595 | 0.086905 | 0.000000 | -0.000001 | 0.000000 | 0.000000 |
14606 rows × 7 columns
client_churn_stats.corr().round(2).loc['churn'].to_frame()
churn | |
---|---|
churn | 1.00 |
cdiff_price_off_peak_var | -0.01 |
cdiff_price_peak_var | 0.03 |
cdiff_price_mid_peak_var | 0.05 |
cdiff_price_off_peak_fix | 0.02 |
cdiff_price_peak_fix | 0.05 |
cdiff_price_mid_peak_fix | 0.04 |
6. Takeaways¶
Having been introduced to our new client in the first part
- Coming out of the first meeting, we need to test a hypothesis raised during this meeting; "how much is price a factor in a customer’s choice"
- We take a first look at the data that was presented to us by our client PowerCo, exploring both the data about the customers' clients as well as their pricings for the entire year 2015
- The client has expressed concern about customers leaving, it was found that only 9.7% of the clients' customers have churned. A small portion, but enough for the client to raise concern
Churn Investigation¶
Feature relations
sales channel
showed that about 60% of churned users came from this subgroup foosdfpfkusacimwkcsosbicdxkicaua. Out of the entire subset, this portion equated to roughly 12% of users.
84% of users who churned didn't have a gas contract has_gas
with the same supplier
Insight into num_years_antig
showes that churn tends to drop the longer a customer is a client, likewise gross margin on power subscription net margin on power subscription showed one of the higher linear correlations to churn, ie. the higher the values tend to be the more likelihood of a churn outcome
Overall, linear correlations of features tended to be quite weak to churn
The number of products nb_prod_act
with our client PowerCo was more insightful, it was found that if the users had 5 products, the churn was 0
Hypothesis check for pricing
To check the brought up hypothesis, we decided to check a few extra for the pricing
standard deviations
of the yearly prices tended to show very small correlation to churn, all indicating that if the pricing is not stable, tendecies to churn increasedPrice hikes
, for which we counted the number of times prices increased over the duration of the year all showed that curstomers tended to churn with increasing number of times prices were increased- Lastly, we checked the
accumulating price
sum for the entire year, overall the correlation didn't change too much either, showing a linear correlation of 0.05 at maximum
These factors tend to show that pricing and volatility aren't the most critical factors in a customer's decision to churn