Feature Engineering and Modelling¶

1. Background¶

Estelle conducted a further review and has provided you with a final dataset to use this for this task, named “data_for_predictions.csv”.

The outputs of your work will be shared with the AD and Estelle has given you a few points to include within the notebook:

Why did you choose the evaluation metrics that you used? Please elaborate on your choices.
Do you think that the model performance is satisfactory? Give justification for your answer.
Make sure that your work is presented clearly with comments and explanations

Binary Classifier¶

From our business requirements, we know that we need to understand why our client's customers are churning. We were given data that corresponds to a binary outcome for each of their customer, which represens the fact of churn in three months (either churn, target = 1 or stayed, taget = 0)

We also noted that various features which we triend to engineer all have very weak linear correlation to the target variable, which suggests that to undestand why customers churn we need to turn to a model which is able to capture the complexity of our non linearity in the dataset

For this reason we decide to treat this problem as as binary classifiation problem in which our model will learn to differentiate between the two target outcomes (churn or stayed)

Descriptive Model¶

Our requirement for our model is that it needs to be able to explain which features we pass into it is relevant and those which are not, for this reason we can turn our attention to Random Forest

In [18]:

Copied!

import dask.dataframe as dd

df = dd.read_parquet('data_for_predictions.parquet')
df.head()
import dask.dataframe as dd

df = dd.read_parquet('data_for_predictions.parquet')
df.head()

Out[18]:

	id	cons_12m	cons_gas_12m	cons_last_month	forecast_cons_12m	forecast_meter_rent_12m	forecast_price_energy_off_peak	forecast_price_energy_peak	forecast_price_pow_off_peak	...	months_modif_prod	months_renewal	channel_MISSING	channel_foosdfpfkusacimwkcsosbicdxkicaua	channel_lmkebamcaaclubfxadlmueccxoimlema	origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws	origin_up_lxidpiddsbxsbosboudacockeimpuepw
0	24011ae4ebbe3035111d65fa7c15bc57	0.000000	4.739944	0.000000	0.000000	0.444045	0.114481	0.098142	40.606701	...	2	6	0	1	0	0	1
1	d29c2c54acc38ff3c0614d0a653813dd	3.668479	0.000000	0.000000	2.280920	1.237292	0.145711	0.000000	44.311378	...	76	4	1	0	0	1	0
2	764c75f661154dac3a6c254cd082ea7d	2.736397	0.000000	0.000000	1.689841	1.599009	0.165794	0.087899	44.311378	...	68	8	0	1	0	1	0
3	bba03439a292a1e166f80264c16191cb	3.200029	0.000000	0.000000	2.382089	1.318689	0.146694	0.000000	44.311378	...	69	9	0	0	1	1	0
4	149d57cf92fc41cf94415803a877cb4b	3.646011	0.000000	2.721811	2.650065	2.122969	0.116900	0.100015	40.606701	...	71	9	1	0	0	1	0

5 rows × 63 columns

7. Finally, let's create a quick summary for the client¶

Before we finish up, the client wants a quick update on the project progress. Your AD wants you to draft an abstract (executive summary) of your findings so far.

Here is your task:¶

Develop an abstract slide synthesizing all the findings from the project so far, keeping in mind that this will be for the key stakeholders meeting which the Head of the SME division, as well as other various stakeholders, will be attending.

Note: a steering committee meeting is a meeting where the BCG team presents key findings and recommendations (and/or project progress) to key client stakeholders.

A few things to think about for this abstract include:

What is the most important number or metric to share with the client?
What impact would the model have on the client’s bottom line?

Please note, there are multiple ways to approach the task and that the sample answer is just one way to do it.

If you are stuck:

What do you think the client wants to hear? How much detail should you go into, especially with the technical details of your work?
Always test what you write with the “so what?” test, i.e. sharing a fact, even an interesting one, only matters if the client can actually do something useful with it.
E.g. 60% of your customers are from City A is pointless, but customers in City A should be prioritized for giving discount as they are among your most valuable ones, if true, is an actionable finding.

In [ ]: