Commonwealth Bank Internship¶
In today's we'll be going through one of internships provided by one of the leading banks in Australia; Commonwealth Bank. The internship is related to Data Science and is aimed to be a small project and interesting project in one's portfolio. The internship covers topics such a data wrangling (rearrangement and exploration of data), data annonymisation, as well as work with unstructured text data. Hopefully it will be interesting to read along!
The internship is split into a few segments:
- Task One : Data Aggregation and Analysis
- Task Two : Data Anonymisation
- Task Three : Propose Data Analysis Approaches
- Task Four : Designing a Database
In this post, we'll cover only the first two parts, the rest your can try for yourself @Introduction to Data Science
Data Aggregation and Analysis¶
Let's look at the first topic which covers some data insight analysis.
Problem Background ¶
Create a data engineering pipeline to aggregate and extract insights from a data set What you'll learn
- Understand the importance of data engineering and analysis in a long-term data-focused vision.
- Gain experience in using real-world transactional data for analysis.
- Learn to perform data analysis and answer specific questions based on the dataset provided.
What you'll do:
- Analyze the "CSV Data Set" of supermarket transactions to address key questions.
- Document the analysis process, including formulas used, using spreadsheet software. Submit the analyzed file for review.
Problem Statement¶
Let's take a look at what is asked of us:
Hello,
I have provided you with a data set named “supermarket_transactions.csv.” This data set contains three years of transactional data we collected from supermarkets across Australia.
InsightSpark wants to access our data to perform some analysis across Australian supermarkets. Before this can happen, I need you to perform some analysis on the provided CSV data set (a sample of the database) to answer the following questions:
- Across locations, how many apples were purchased in cash?
- How much total cash was spent on these apples?
- Across all payment methods, how much money was spent at the Bakershire store location by non-member customers?
Dataset Preview¶
Quite a simple task at hand; we are given a set of transactional data (of course downscaled) which we first need to take a quick look at in order to understand what we're actually have. The first two problems are associated with a single product, so we'll grab a sample of that dataset:
apple_transactions = transactions[transactions['product_name'] == 'apple']
apple_transactions.head()
+-------+-------+--------------------------------------+---------------------+----------+--------------------------------------+--------------+------------+--------------+-----------------+----------------+--------------------------------------+---------------+
| | index | id | timestamp | quantity | product_id | product_name | unit_price | total_amount | store | payment_method | customer_id | customer_type |
+-------+-------+--------------------------------------+---------------------+----------+--------------------------------------+--------------+------------+--------------+-----------------+----------------+--------------------------------------+---------------+
| 23742 | 23742 | 619db75d-bc5d-40b9-acf8-93476d29a36b | 2021-12-15 17:56:00 | 10 | 34bc5a93-7c74-464b-94bb-f9b31f0edca5 | apple | 4.59 | 45.9 | South Billyview | cash | 9de2834c-9a6b-4682-b314-8ec31063ee06 | member |
| 23743 | 23743 | bbc7a9e8-c272-431d-9703-9c3724eab634 | 2021-04-04 16:51:00 | 8 | 34bc5a93-7c74-464b-94bb-f9b31f0edca5 | apple | 4.59 | 36.72 | West Stefanie | credit card | 7fcc7661-c7ef-4ac2-b7de-e6479930a414 | non-member |
| 23744 | 23744 | 2f9b0c56-b281-4b42-b103-95f2bbd51586 | 2021-03-29 12:07:00 | 9 | 34bc5a93-7c74-464b-94bb-f9b31f0edca5 | apple | 4.59 | 41.31 | Andreburgh | contactless | 0e86f2a0-75db-4480-bf9b-c7726f76b6a0 | member |
| 23745 | 23745 | e2fe4f8e-421c-4c72-9310-b5a17cc2fcd4 | 2019-10-10 17:39:00 | 9 | 34bc5a93-7c74-464b-94bb-f9b31f0edca5 | apple | 4.59 | 41.31 | Anthonyton | debit card | 71d16b34-726d-4097-b65e-78d0defcbf4b | premium |
| 23746 | 23746 | 4e8c369b-fe2d-42fa-8dcb-2484c4529ded | 2019-10-09 17:35:00 | 4 | 34bc5a93-7c74-464b-94bb-f9b31f0edca5 | apple | 4.59 | 18.36 | West Stefanie | cash | 247e3a5a-8a7a-4c81-80a6-a58145e70621 | gold |
+-------+-------+--------------------------------------+---------------------+----------+--------------------------------------+--------------+------------+--------------+-----------------+----------------+--------------------------------------+---------------+
Data Column Explanations¶
Lets try to summarise what the data contains:
id
: Transaction unique identifiertimestamp
: Time of the transactionquantity
: Number of items purchased of the productproduct_id
product_name
: Name of the productunit_price
: Price for each itemtotal_amount
: The total ammount for the purchasestore
: Location where the item was purchasedpayment_method
: The method of paymentcustomer_id
: Unique customer identification numbercustomer_type
: Customer type
Transaction / Purchase Format¶
Transactions can be stored in data in a number of ways, we can see that our dataset has a rather simple structure, each customer (customer_id) has made a single product purchase (product_id), which is represented by its unique transaction identifier (id), nothing too fancy.
Across locations, how many apples were purchased in cash?¶
The first request which is asked of us is to extract the number of purchases of apples for each location. We can do this by using the pivot_table
method. Since we are interested in the amount, we need to make sure we select quantity for values and set aggfunc to sum
pd.pivot_table(apple_transactions,index='store',columns='payment_method',values='quantity',aggfunc='sum',fill_value=0,margins=True).sort_values(by='cash',ascending=False)
+---------------------+------+-------------+-------------+------------+-----+
| store | cash | contactless | credit card | debit card | All |
+---------------------+------+-------------+-------------+------------+-----+
| All | 117 | 104 | 138 | 88 | 447 |
| South Cynthia | 18 | 0 | 0 | 0 | 18 |
| Swansonfurt | 17 | 0 | 10 | 0 | 27 |
| South Billyview | 12 | 0 | 0 | 0 | 12 |
| East Suzanneside | 10 | 0 | 0 | 0 | 10 |
| South Michaelfurt | 10 | 0 | 0 | 0 | 10 |
| South Edward | 8 | 2 | 0 | 0 | 10 |
| Lake Bryan | 8 | 9 | 0 | 0 | 17 |
| Tracyton | 7 | 0 | 0 | 0 | 7 |
| Erichaven | 7 | 0 | 10 | 0 | 17 |
| Alexmouth | 6 | 0 | 0 | 0 | 6 |
| West Stefanie | 4 | 0 | 13 | 0 | 17 |
| Jordanmouth | 4 | 8 | 0 | 0 | 12 |
| Julieview | 3 | 0 | 1 | 0 | 4 |
| Port Angela | 2 | 0 | 8 | 4 | 14 |
| Andersonland | 1 | 0 | 10 | 8 | 19 |
| Christopherfurt | 0 | 2 | 5 | 5 | 12 |
| South Christineside | 0 | 0 | 0 | 8 | 8 |
| Charlesbury | 0 | 10 | 0 | 6 | 16 |
| Aprilside | 0 | 7 | 0 | 0 | 7 |
| Anthonyton | 0 | 0 | 0 | 11 | 11 |
| East Sara | 0 | 0 | 11 | 5 | 16 |
| South Rachaelport | 0 | 0 | 1 | 3 | 4 |
| South Alyssa | 0 | 2 | 0 | 0 | 2 |
| Andreburgh | 0 | 9 | 0 | 2 | 11 |
| Vincentville | 0 | 0 | 9 | 0 | 9 |
| West John | 0 | 7 | 0 | 2 | 9 |
| Anthonymouth | 0 | 0 | 17 | 3 | 20 |
| North Joyfort | 0 | 8 | 0 | 0 | 8 |
| Port Emilymouth | 0 | 8 | 0 | 0 | 8 |
| East Jeremytown | 0 | 0 | 0 | 1 | 1 |
| North Charles | 0 | 0 | 2 | 0 | 2 |
| New Lisa | 0 | 0 | 0 | 7 | 7 |
| New Glenn | 0 | 0 | 10 | 0 | 10 |
| New Eric | 0 | 16 | 5 | 1 | 22 |
| East Ann | 0 | 6 | 0 | 0 | 6 |
| Justinstad | 0 | 0 | 0 | 4 | 4 |
| Jessicafort | 0 | 10 | 4 | 0 | 14 |
| Jaredside | 0 | 0 | 0 | 4 | 4 |
| Irwinport | 0 | 0 | 16 | 0 | 16 |
| East Candiceton | 0 | 0 | 0 | 7 | 7 |
| Michelemouth | 0 | 0 | 6 | 7 | 13 |
+---------------------+------+-------------+-------------+------------+-----+
How much total cash was spent on these apples?¶
The next question is pretty much the same, except we set the column to total_amount, since we are interested in the total ammount spend on this item across all transaction purchases.
df = pd.pivot_table(apple_transactions,index='store',columns='payment_method',values='total_amount',aggfunc='sum',fill_value=0,margins=True).sort_values(by='cash',ascending=False).round(2)
+---------------------+--------+-------------+-------------+------------+---------+
| store | cash | contactless | credit card | debit card | All |
+---------------------+--------+-------------+-------------+------------+---------+
| All | 537.03 | 477.36 | 633.42 | 403.92 | 2051.73 |
| South Cynthia | 82.62 | 0.0 | 0.0 | 0.0 | 82.62 |
| Swansonfurt | 78.03 | 0.0 | 45.9 | 0.0 | 123.93 |
| South Billyview | 55.08 | 0.0 | 0.0 | 0.0 | 55.08 |
| South Michaelfurt | 45.9 | 0.0 | 0.0 | 0.0 | 45.9 |
| East Suzanneside | 45.9 | 0.0 | 0.0 | 0.0 | 45.9 |
| South Edward | 36.72 | 9.18 | 0.0 | 0.0 | 45.9 |
| Lake Bryan | 36.72 | 41.31 | 0.0 | 0.0 | 78.03 |
| Tracyton | 32.13 | 0.0 | 0.0 | 0.0 | 32.13 |
| Erichaven | 32.13 | 0.0 | 45.9 | 0.0 | 78.03 |
| Alexmouth | 27.54 | 0.0 | 0.0 | 0.0 | 27.54 |
| West Stefanie | 18.36 | 0.0 | 59.67 | 0.0 | 78.03 |
| Jordanmouth | 18.36 | 36.72 | 0.0 | 0.0 | 55.08 |
| Julieview | 13.77 | 0.0 | 4.59 | 0.0 | 18.36 |
| Port Angela | 9.18 | 0.0 | 36.72 | 18.36 | 64.26 |
| Andersonland | 4.59 | 0.0 | 45.9 | 36.72 | 87.21 |
| Christopherfurt | 0.0 | 9.18 | 22.95 | 22.95 | 55.08 |
| South Christineside | 0.0 | 0.0 | 0.0 | 36.72 | 36.72 |
| Charlesbury | 0.0 | 45.9 | 0.0 | 27.54 | 73.44 |
| Aprilside | 0.0 | 32.13 | 0.0 | 0.0 | 32.13 |
| Anthonyton | 0.0 | 0.0 | 0.0 | 50.49 | 50.49 |
| East Sara | 0.0 | 0.0 | 50.49 | 22.95 | 73.44 |
| South Rachaelport | 0.0 | 0.0 | 4.59 | 13.77 | 18.36 |
| South Alyssa | 0.0 | 9.18 | 0.0 | 0.0 | 9.18 |
| Andreburgh | 0.0 | 41.31 | 0.0 | 9.18 | 50.49 |
| Vincentville | 0.0 | 0.0 | 41.31 | 0.0 | 41.31 |
| West John | 0.0 | 32.13 | 0.0 | 9.18 | 41.31 |
| Anthonymouth | 0.0 | 0.0 | 78.03 | 13.77 | 91.8 |
| North Joyfort | 0.0 | 36.72 | 0.0 | 0.0 | 36.72 |
| Port Emilymouth | 0.0 | 36.72 | 0.0 | 0.0 | 36.72 |
| East Jeremytown | 0.0 | 0.0 | 0.0 | 4.59 | 4.59 |
| North Charles | 0.0 | 0.0 | 9.18 | 0.0 | 9.18 |
| New Lisa | 0.0 | 0.0 | 0.0 | 32.13 | 32.13 |
| New Glenn | 0.0 | 0.0 | 45.9 | 0.0 | 45.9 |
| New Eric | 0.0 | 73.44 | 22.95 | 4.59 | 100.98 |
| East Ann | 0.0 | 27.54 | 0.0 | 0.0 | 27.54 |
| Justinstad | 0.0 | 0.0 | 0.0 | 18.36 | 18.36 |
| Jessicafort | 0.0 | 45.9 | 18.36 | 0.0 | 64.26 |
| Jaredside | 0.0 | 0.0 | 0.0 | 18.36 | 18.36 |
| Irwinport | 0.0 | 0.0 | 73.44 | 0.0 | 73.44 |
| East Candiceton | 0.0 | 0.0 | 0.0 | 32.13 | 32.13 |
| Michelemouth | 0.0 | 0.0 | 27.54 | 32.13 | 59.67 |
+---------------------+--------+-------------+-------------+------------+---------+
Across all payment methods, how much money was spent at the Bakershire store location by non-member customers?¶
Next up, we need to verify what the column name is for non members, so we can select the appropriate subset of data and evaluate how much was spent in the particular store. We can see that the specified name does match that found in the dataset.
customer_type
non-member 8650
corporate 8536
premium 8427
staff 8397
member 8395
gold 8378
Name: count, dtype: int64
Next, we select the subset of interest using two condition that need to be simultaneously met, this gives us 179 transactions for which we will find the transaction sum.
bakershire_nonmembers = transactions[(transactions['store'] == 'Bakershire') & (transactions['customer_type'] == 'non-member')]
bakershire_nonmembers.shape
(179, 12)
And we calculate the total income from all the transactions
So that wasn't too difficult! The answers are provided below:
Success
Example Answer Explanation
The example answer is just one way to approach this problem, not the only solution. For this question, first, we opened the CSV file supermarket_transactions.csv
in Excel. Then we aggregated and analysed the data using filters and formulas to answer the following questions:
Question 1: Across locations, how many apples were purchased in cash?
To answer this question, we filtered the data sheet to include only rows where the
product_name
is “apple” and where the payment_method
is “cash.” Then, we summed the
quantity
column to get an answer of 117 apples.
Question 2: How much total cash was spent on these apples?
Here, we left the data sheet filtered to include only rows where the product_name
is “apple” and where the payment_method
is “cash.” Then, we summed the total_amount
column to get an answer of $537.03.
Question 3: Across all payment methods, how much money was spent at the Bakershire store location by non-member customers?
For this question, we cleared all the previous filter criteria. Then, we filtered the data sheet to include only rows where the customer_type
was non-member, and the store
was Bakershire. Then, we summed the total_amount
column to get an answer of $2,857.51.
Data Anonymisation¶
Now let's look at the second topic, which involves the use of data anonimisation.
Problem Background ¶
- Understand the importance of data anonymization in protecting sensitive customer information.
- Learn various techniques for anonymizing data while preserving its utility.
- Comprehend the risks associated with linkage attacks on anonymized data.
What you'll do:
- Anonymize the provided data set, ensuring the protection of personal details.
- Utilize anonymization techniques such as removing unnecessary columns, masking identifying information, and categorizing sensitive figures.
- Submit the anonymized data as a CSV file.
Problem Statement¶
I have provided you with a data set named “mobile_customers.csv.” This contains information about customers that have signed up for the mobile app in the last three years.
We need you to anonymise this data to hide personal details while preserving any useful information for the data scientists at InsightSpark.
Here are some examples of how you may anonymise a data set: You could remove columns that don’t provide helpful information for analysis (e.g., names or credit card numbers). You could mask any columns that can identify an individual (e.g., passport numbers or mobile numbers). You could categorise personal figures (e.g., age and income) into a bracket rather than a specific number.
First, research the different techniques available for anonymising a data set. Then, edit the data set to create an anonymised data set as a CSV file. When finished, please submit this CSV file for me to review before we share it with InsightSpark.
Data Anonymisation approaches¶
We don't want customer data to leak out, nevertheless there could be various applications of customer data usage; from analyses to modeling. Generally speaking, there is a standard of data privacy "criticalness", and a subsystem in a company operate with data corresponding to a specific level, so customer data doesn't tend to freeflow through all employees in a company. The process of data protection is probably done by a specific department in charge of data protection, or something along those lines.
The data transfer target are data scientists at InsightSpark, so we should probably put some logic into our process of transformation, since the data science team will probably want to extract useful relations from the data, without really needing to know the exact details about the customers. So this means that pure randomisation data replacement will be rather pointless, yet its a way we can anonymise the data.
Other ways can be:
- Binning (grouping into segments)
- Masking (hiding only parts of the data)
- Replacement (replace the unique values in data)
- Perturbation (for numerical values)
Data Column Explanations¶
Lets read our dataset to get a glimpse of our dataset
customers = pd.read_csv('/kaggle/input/commonwealth-bank-internship-data/mobile_customers.csv')
customers.head()
+---+--------------------------------------+-----------------+-----------------+-----------------+--------+---------------------------+----------------------------+------------+-------------------------------+----------------------------+-----------------------------+--------------------------+-----+--------+----------------------+---------------------+---------------------------+--------------------+
| | customer_id | date_registered | username | name | gender | address | email | birthdate | current_location | residence | employer | job | age | salary | credit_card_provider | credit_card_number | credit_card_security_code | credit_card_expire |
+---+--------------------------------------+-----------------+-----------------+-----------------+--------+---------------------------+----------------------------+------------+-------------------------------+----------------------------+-----------------------------+--------------------------+-----+--------+----------------------+---------------------+---------------------------+--------------------+
| 0 | 24c9d2d0-d0d3-4a90-9a3a-e00e4aac99bd | 2021-09-29 | robertsbryan | Jonathan Snyder | M | 24675 Susan Valley | marcus58@hotmail.com | 1978-03-11 | ['78.937112', '71.260464'] | 195 Brandi Junctions | Byrd, Welch and Holt | Chief Technology Officer | 49 | 53979 | VISA 19 digit | 38985874269846 | 994 | 10/27 |
| | | | | | | North Dianabury, MO 02475 | | | | New Julieberg, NE 63410 | | | | | | | | |
| 1 | 7b2bc220-0296-4914-ba46-d6cc6a55a62a | 2019-08-17 | egarcia | Susan Dominguez | F | 4212 Cheryl Inlet | alexanderkathy@hotmail.com | 1970-11-29 | ['-24.1692185', '100.746122'] | 58272 Brown Isle Apt. 698 | Hurst PLC | Data scientist | 43 | 81510 | Discover | 6525743622515979 | 163 | 07/30 |
| | | | | | | Port Davidmouth, NC 54884 | | | | Port Michael, HI 04693 | | | | | | | | |
| 2 | 06febdf9-07fb-4a1b-87d7-a5f97d9a5faf | 2019-11-01 | turnermegan | Corey Hebert | M | 07388 Coleman Prairie | vwood@gmail.com | 2009-04-23 | ['8.019908', '-19.603269'] | 36848 Jones Lane Suite 282 | Mora, Caldwell and Guerrero | Chief Operating Officer | 47 | 205345 | VISA 16 digit | 4010729915028682247 | 634 | 04/26 |
| | | | | | | Lake Amy, IA 78695 | | | | Marquezbury, ID 26822 | | | | | | | | |
| 3 | 23df88e5-5dd3-46af-ac0d-0c6bd92e4b96 | 2021-12-31 | richardcampbell | Latasha Griffin | F | PSC 6217, Box 2610 | kathleen36@gmail.com | 1992-07-27 | ['62.497506', '2.717198'] | 317 Lamb Cape Apt. 884 | Patel PLC | Counselling psychologist | 34 | 116095 | VISA 16 digit | 4854862659569207844 | 7957 | 10/31 |
| | | | | | | APO AA 53585 | | | | Lake Amy, DC 79074 | | | | | | | | |
| 4 | 6069c2d7-7905-4993-a155-64f6aba143b1 | 2020-08-09 | timothyjackson | Colleen Wheeler | F | 0325 Potter Roads | johnbest@hotmail.com | 1989-09-16 | ['73.7924695', '-80.314720'] | 21936 Mary Islands | Smith-Mejia | Mining engineer | 57 | 107529 | JCB 16 digit | 213152724828217 | 72 | 05/28 |
| | | | | | | Lake Lisashire, NM 77502 | | | | Mendozafort, TN 37124 | | | | | | | | |
+---+--------------------------------------+-----------------+-----------------+-----------------+--------+---------------------------+----------------------------+------------+-------------------------------+----------------------------+-----------------------------+--------------------------+-----+--------+----------------------+---------------------+---------------------------+--------------------+
Lets describe the columns that we have:
customer_id
: Identification assigned to customerdate_registered
: The data of registration perturbationusername
: The username the customer usesname
: The customers' namegender
: Gender of the customeraddress
: Customers' addressemail
: Customers' email maskbirthday
: Date of bird of customer perturbationcurrent_location
: Location of the customer removeresidence
: Residence of the customeremployer
: Employer of the customer replaceage
: Age of the customer binsalary
: Customers' salary bincredit_card_provider
: Customers' credit card providercredit_card_number
: The customers' credit card numbercredit_card_security_code
: Customers' credit card security codecredit_card_expire
: Expiration data of the customers' credit card
Let's start with something simple; binning. Both Age and **Salary can definitely be categorised, not providing the exact number, but still keeping enough information so some insights about the customers can be made.
bins = [0, 18, 30, 45, 60, 100]
labels = ['0-18', '19-30', '31-45', '46-60', '60+']
customers['age'] = pd.cut(customers['age'], bins=bins, labels=labels)
customers['age'].value_counts()
bins = [0,79000,119000,169000,200000, 300000]
labels = ['0-79K', '80-119K', '120-169K', '170-200K', '200K+']
customers['salary'] = pd.cut(customers['salary'], bins=bins, labels=labels)
customers['salary'].value_counts()
We will also remove column current_location, since it doesn't really hold any value outside perhaps tracking and doesn't offer what residence doesn't already offer.
Employer can be a useful feature, we can see that many customers share the same employers, yet there is no real need to reveal the exact employer, so we can simply replace the unique column names with randomly generated text.
employer
Smith and Sons 18
Smith Inc 15
Smith PLC 11
Johnson Group 11
Johnson Ltd 11
..
Hicks, Ramirez and Taylor 1
Moreno, Mccarthy and Donovan 1
Klein-Foley 1
Watson-Watson 1
Warner, Munoz and Franklin 1
Name: count, Length: 8645, dtype: int64
So what we'll do is extract the unique values, create a mapping dictionary for it, generate random values and text, concatenate it and use the map method to replace the column values
import random
import string
unique_employer = list(customers['employer'].unique())
unique_employer_mapper = {i:None for i in unique_employer}
def generate_random_name():
digits = "".join( [random.choice(string.digits) for i in range(20)] )
chars = "".join( [random.choice(string.ascii_letters) for i in range(20)] )
return digits + chars
for employer in unique_employer:
unique_employer_mapper[employer] = generate_random_name()
customers['employer'].map(unique_employer_mapper).value_counts()
employer
58604887536751648622FbvCGfRGBtnDGLWldyRF 18
18629626800688749605NmGJKOfhUhLunfJBxgOT 15
93505473119626534727ZBuGpghisKSeKttAWFBj 11
67338527509968294563wdJIoNdzKxpDAlhQYWqS 11
32329694609232288444HgflONofgmdPExwgiYPu 11
..
42630313409469792959ODugkraqZKMZhuvUzsHK 1
94222944269364035894tWTTBJxWfHBIdGfAOlED 1
55826701092383385383OpGYMhvZiWMmgZssUYmg 1
37074720113318535258XlwfPhjwwIDLFlNlkgJI 1
17949949387540722650PQkBoWfJTiCNVOSWzpAI 1
Name: count, Length: 8645, dtype: int64
Email can also contain useful information; whilst the user name does contain important information to the customer, the provider can reveal some patterns between the users of the same provider, so what oftern is done is a mask is placed in between the second and secondlast letter of the username part of the email address; so that's what we'll do as well! We'll use the regular expression library, which should be fun!
The pattern that we'll utilise is as follows:
- The first character of the username (^[a-zA-Z0-9]{1})
- The middle part of the username that we want to mask (.*)
- The last character of the username followed by the domain ([a-zA-Z0-9]{1}@.*$)
import re
def mask_email_username(email):
# Regular expression to match the username part of the email
pattern = r'(^[a-zA-Z0-9]{1})(.*)([a-zA-Z0-9]{1}@.*$)'
# Substitute the middle part with '*' characters
masked_email = re.sub(pattern, lambda m: f"{m.group(1)}{'*' * (len(m.group(2)))}{m.group(3)}", email)
return masked_email
And with pandas apply & lambda functions we'll apply the function to all rows
0 m******8@hotmail.com
1 a************y@hotmail.com
2 v***d@gmail.com
3 k********6@gmail.com
4 j******t@hotmail.com
...
9995 j*****n@yahoo.com
9996 r************z@gmail.com
9997 b*********a@gmail.com
9998 w***d@gmail.com
9999 n*****s@yahoo.com
Name: email, Length: 10000, dtype: object
Another method we can also attempt to utilise is perturbation; we can apply it to values or dates as we'll do below. date_registered seems to be quite harmless compared to other information about the users, but we might not want to give away the exact details, slight perturbation (addition of noise) probably will allow the data science team to analyse if there are some similarities between users that joined during similar period (month), but we won't be revealing the exact date (so dates probably will be noisy); so lets add/subtract a couple of days from the date.
import numpy as np
def random_days_adjustment(date):
# Generate a random integer between 1 and 7
days = np.random.randint(1, 8)
# Randomly choose to add or subtract days
adjustment = np.random.choice([-1, 1]) * days
return date + pd.Timedelta(days=adjustment)
customers['date_registered'] = pd.to_datetime(customers['date_registered'])
customers['date_registered'].apply(random_days_adjustment)
0 2021-09-25
1 2019-08-15
2 2019-10-30
3 2021-12-30
4 2020-08-08
...
9995 2021-09-09
9996 2020-08-09
9997 2019-10-29
9998 2020-08-18
9999 2020-03-30
Name: date_registered, Length: 10000, dtype: datetime64[ns]
Whilst the above approach may be quite misleading to the recipient, it probably is better to just retain the year and month of birth.
These are only a few ways, in fact there doesn't exist a right or wrong answer, without any formalisation or implemented standard that we must follow, we could say that we're done! :3 Of course there is still some more sensitive data left, but that's for another time!
The answers are provided below:
Success
The example answer is one way to approach this problem and not the only solution. To create the model answer, we built the following steps into a data privacy pipeline:
- Remove the customer_id and current_location column. Removal of data (redaction) that does not have much informational value is a valid data privacy technique.
- Mask the username column to hide the real username.
- Replace the original name column with a fake name. Replacing real values for fake values is a valid data privacy technique.
- Mask the email column to hide the real email address.
- Add noise to the date_registered and birthdate columns to hide the real value. Adding noise protects the real value by adding random noise to the actual value.
- Categorise the salary and age columns into bins. This categorisation hides the original values and preserves the distribution.
- The credit_card_provider and credit_card_expire have been tokenised. This step converts the categorical value of the columns into a different random value while preserving the distribution.
- The credit_card_number and credit_card_security_code have been masked.
- The employer and job columns have also been tokenised to preserve the original distribution.
- The residence and address have been replaced with fake values.
Concluding Remarks¶
In this post, we covered a coupled of things; we conducted a simple data analysis of user transactions, in principle, the questions are quite straightforward, and more insight into the dataset may be quite interest, it would have been more interesting to receive data of more realistic transactions. Nevertheless, pivot_tables are quite useful to extract useful information from a dataset, so its good to know the useful arguments that the method offers; an example may be the addition of total using margins and so on.
Secondly, we looked at another interesting part; data confidentiality. This is actually quite an important topic, since data needs to be carefully managed, especially when data is passed from the source of storage, to another party. In our case, we were taught that the data is to be provided to the data science team for analysis, so bearing that in mind we need to understand what is relevant to be passed on and what is of no use.
The internship doesn't end here, there is also another section about unstructured data extraction and analysis, using the Twitter API, which is a fun NLP problem!
The intership problem was quite interesting and lightweight, definitely give it a go! @Introduction to Data Science
Important Takeaways¶
Lets recap some things which are worth remembering, as it can help you with your own projects!
Pandas Column Binning¶
We utilised the pandas method pd.cut
in order to split the column data into segments, this is an important method, and it is often used in data analysis and machine learning pipelines.
bins = [0,79000,119000,169000,200000, 300000]
labels = ['0-79K', '80-119K', '120-169K', '170-200K', '200K+']
customers['age'] = pd.cut(customers['age'], bins=bins, labels=labels)
What the above code will do is simply split the data into parts: - values between 0-79000 will be assigned a label '0-79k' and so on
pd.cut(
data, # pandas dataframe column
bins, # numerical value split location
labels # labels for each segment
)
Pandas Column Value Replacement¶
We often do need to replace specific values/categories in the pandas column. In the above example, it was for customer data annonymisation, do replace values we can use the pd.Series.map
method.
Next, we need to have a dictionary in the format:
So when we will call the following, column value will be replaced by change it to this
Pandas Pivot Table¶
Pivoting data and rearranging it is also very important in data analysis, we can understand our data better if we know how to use the pd.pivot_table
as we did in the example below:
pd.pivot_table(apple_transactions,index='store',columns='payment_method',values='quantity',aggfunc='sum',fill_value=0,margins=True).sort_values(by='cash',ascending=False)
pd.pivot_table(
data, # our dataframe
index, # the unique categories in a column will be set as the index
columns, # the unique categories in a column will be set as the columns
values, # we will count/sum/... the data in this column
aggfunc, # the aggregate function (sum,count,...)
fill_value, # useful to fill in missing data for combinations that dont exist
margins, # add a "total" column, so we know the row/column sums
)
Regular Expressions¶
Another approach we used here are regular expressions, they are very useful when dealing with string data. We'll take a look at the pattern we used in this post;
We need to focus on three parts:
- 1st Capturing Group (^[a-zA-Z0-9]{1}) : the first letter
- 2nd Capturing Group (.*) : everything in between the first and last letters of the username
- 3rd Capturing Group ([a-zA-Z0-9]{1}@.*$) : The last letter, then @ and the rest after that
We do this by using three capture groups. [a-zA-Z0-9] implies that the first letter can be either numeric or alphabetical and {1} ensures that we only select the first letter/value. The second capture group captures everything after the first letter (.*), but since we have a third capture group, we once again select a letter/number which is followed by the sign @, this is the capture requirement, and since we place the capture group in third, our second capture group captures everything upto that which is found in the third capture group. After the @ sign we select everything else using the dot (.) and star notation.
A very helpful resource to play around with regular expressions; regex101