Financial consumer complaint analysis
pip install -U kaleido
Collecting kaleido Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.9/79.9 MB 13.6 MB/s eta 0:00:00 Installing collected packages: kaleido Successfully installed kaleido-0.2.1 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Note: you may need to restart the kernel to use updated packages.
¶
In this section we will outline what customer feedback is, why it is an important part of any business, not only for financial companies, but in general. Show some examples, which should show that it can take some time to manually read and analyse what each consumer complaint is about. How can we utilise consumer feedback to streamline the process of consumer-company interaction (the need)
CONSUMER FEEDBACK
¶Let's point our some key points about consumer feedback:
Consumer feedback is an important part of day to day financial business operations Companies offering products must be able to know what their consumers think of their products Eg. positive & negative feedback & can be obtained from a number of sources (eg. twitter) In this case, we obtain data from a database, which registers consumers feedback of financial products Customers have specific issue on a number of topics they want want the company to address The form of consumer communication with the financial institution is via the web (as will be shown later)
USEFULNESS OF CONSUMER FEEDBACK
Can your customers tell you something important? | Source
If you run your own business, I know you do your best to please your customers
Satisfy their needs, and keep them loyal to your brand.
But how can you be sure that your efforts bring desired results?
If you do not try to find out what your clients think about your service
You will never be able to give them the best customer experience.
Their opinions about their experience with your brand are helpful information
That you can use to adjust your business to fit their needs more accurately
- The source clearly outlines that consumer feedback is quite critical for any business
- Consumer feedback in our problem is related to a consumer having an
issue
with a particualr financialproduct
or alike
CONSUMER FEEDBACK EXAMPLES
Let's look at a couple of examples of a consumer's addresses to a company:
Product: Credit reporting | Issue: Incorrect information on credit report
After looking at my credit report I saw a collection account that does not belong to me. I am not allowed to dispute this information online on Experian or over the phone making it impossible for me. This false information is ruining my credit and knowing full well this people did not do their job and allow people to just post false accounts on my report. They need to delete this information immediately and do a proper investigation as this information is not mine. '
Product: Credit card | Issue: Credit line increase/decrease
"XXXX i receive an email from citibank regarding my XXXX credit card. It was an offer to request a credit increase and it clearly stated that there would be NO Credit bureau inquiry made. I clicked on the link in the email and entered the requested information. a couple of days later I received an alert from my credit bureau monitoring service that a hard inquiry was done. Upon looking at the report it showed Citibank credit cards making a hard credit inquiry which was completely opposite of what their email stated. I called citi and they confimred that the email stated there would be no creidt inquiry done however they said that the request was made on a different citibank credit card which is why the hard inquiry was made. I explained to the rep I clicked on the link they provided and if was for a different account of mine it was not my issue but theirs and they need to remove the inquiry. They told me to send a letter to their credit dispute department explaining it. I sent the letter after waiting more than a month I received a blunt statement stating the it was a valid credit request and they will not remove the inquiry from my credit bureau. Citi performed bait and switch by offering a no inquiry credit request and then doing a hard inquiry which has negatively affected my credit score. I asked to remove it and received a generic letter stating they would not with no number to contact the department that sent the letter when i called the main customer service number they said that department dosent talk to customers and there was nothing else they can do. This has negatively affected my credit score and will remain on my credit report for 2 years because citi 's False advertising. and then their lack of fixing their error "
After reading this long complaint:
- It should become apparent that manual evaluation of each consumer issue can can a while to process and is very inefficient
- For a timely & helpful consumer response, the relevant problem not only must be processed in a timely manner, but passed on to a specific expert that has experience dealing with the particular issue
¶
STUDY AIM¶
- In this notebook, we'll be utilising machine learning, to create a model(s) that will be able to classify the type of complaint (as we did above) (by both product & issue)
- Such a model can be useful for a company to quickly understand the type of complaint & appoint a financial expert that will be able to solve the problem
AUTOMATED TICKET CLASSIFICATION MODEL¶
Our approach will include separate models, that will be in charge of classifying data on different subsets of data
- M1 will be classifying a product based on the customer's input complaint (
text
) (eg. Credit Reporting) - M2 will be in charge of classifying the particular issue to which the complaint belongs to (text)
¶
In this section, we will dive in to the dataset, making some slight adjustments; loading the data, looking at missing data, looking at the features & make some slight adustments
CONSUMER COMPLAINT DATABASE¶
- Download full Dataset from the provided link if you want to have the up to date data
- Complaints that the Consumer Financial Protection Bureau (CFPB) sends to companies for response are published in the Consumer Complaint Database after the company responds
- Confirming a commercial relationship with the consumer, or after 15 days, whichever comes first
LOAD DATASET¶
We start off by loading the dataset (we are loading the dataset without any missing data in consumer complaint narrative
(which is the complaint)
%%time
import pandas as pd
import plotly.express as px
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('/kaggle/input/complaintsfull/main.csv',low_memory=False)
df = df.drop(['Unnamed: 0'],axis=1)
CPU times: user 22.3 s, sys: 3.15 s, total: 25.5 s Wall time: 36.9 s
TARGET LABELS¶
A quick glimps into the dataset gives us the view of the features that will be of interest to us in this study
- Product (Type of financial product)
- Sub-product (A more detailed subset of product)
- Issue (What was the problem)
- Sub-issue (A more detailed subset of product)
MISSING DATA¶
Visualise missing data in the dataset, looks like we have quite a bit overall & some in target variables (Sub-Product
& Sub-Issue
)
import missingno as ms
ms.matrix(df)
<AxesSubplot:>
- We have quite a bit of missing data, we have already removed rows, which have missing data in our text target (consumer complaint narrative)
- And quite a heavy dataset, let's utilise only the relevant data for our problem (by the end of this section) which should reduce the number of rows in our data significatntly
FEATURE DESCRIPTION¶
Brief summary of what each feature represents in our dataset
print('Dataset Features')
df.columns
Dataset Features
Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID', 'Year', 'Month', 'Day', 'DoW'], dtype='object')
- Date received - When the complaint was addressed
- Product - Complaint Type
- Issue - Brief summary of the issue
- Consumer complaint narrative - What the customer wrote (documents)
- Company public response - How did the company respond
- State - State in which the complaint was made
- Submitted - Form of complaint
- Customer disputed? - Did the customer dispute the response
ADDING DATETIME FEATURES¶
Lets add time-series based features, normalise column and column subset names & remove some column subset data for our target variable
- We have two timeline features,
Date received
&Date sent to company
- Lets extract additional features which can be useful for EDA
def object_to_datetime_features(df,column):
df[column] = df[column].astype('datetime64[ns]')
df['Year'] = df[column].dt.year
df['Month'] = df[column].dt.month
df['Day'] = df[column].dt.day
df['DoW'] = df[column].dt.dayofweek
df['DoW'] = df['DoW'].replace({0:'Monday',1:'Tuesday',2:'Wednesday',
3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'})
return df
df = object_to_datetime_features(df,'Date received')
df.columns
Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID', 'Year', 'Month', 'Day', 'DoW'], dtype='object')
COLUMN NAME NORMALISATION¶
Lets convert all column names to a lower register
# lower the register of columns
def normalise_column_names(df):
normalised_features = [i.lower() for i in list(df.columns)]
df.columns = normalised_features
return df
df = normalise_column_names(df)
NORMALISATION OF SUBSET NAMES¶
Lets convert all subset feature names into a lower register as well
# show the names of each subset
def show_subset_names(df,column):
return df[column].value_counts().index
show_subset_names(df,'product')
Index(['Credit reporting, credit repair services, or other personal consumer reports', 'Debt collection', 'Mortgage', 'Credit card or prepaid card', 'Checking or savings account', 'Student loan', 'Credit reporting', 'Money transfer, virtual currency, or money service', 'Vehicle loan or lease', 'Credit card', 'Bank account or service', 'Payday loan, title loan, or personal loan', 'Consumer Loan', 'Payday loan', 'Money transfers', 'Prepaid card', 'Other financial service', 'Virtual currency'], dtype='object')
def normalise_subset_names(df,column):
subset_names = list(df[column].value_counts().index)
norm_subset_names = [i.lower() for i in subset_names]
dict_replace = dict(zip(subset_names,norm_subset_names))
df[column] = df[column].replace(dict_replace)
return df
df = normalise_subset_names(df,'product')
show_subset_names(df,'product')
Index(['credit reporting, credit repair services, or other personal consumer reports', 'debt collection', 'mortgage', 'credit card or prepaid card', 'checking or savings account', 'student loan', 'credit reporting', 'money transfer, virtual currency, or money service', 'vehicle loan or lease', 'credit card', 'bank account or service', 'payday loan, title loan, or personal loan', 'consumer loan', 'payday loan', 'money transfers', 'prepaid card', 'other financial service', 'virtual currency'], dtype='object')
FILTER SUBSET DATA¶
Let's keep only specific subsets of data in the product column
# keep only specific subset in a feature
def keep_subset(df,column,lst):
all_features = list(df[column].value_counts().index)
keep_features = lst
# subset data
subset_data = dict(tuple(df.groupby(column)))
subset_data_filter = lambda x, y: dict([ (i,x[i]) for i in x if i in set(y) ])
# dictionary with only selected keys
filtered_data = subset_data_filter(subset_data,lst)
filtered_df = pd.concat(filtered_data.values())
filtered_df.reset_index(drop=True,inplace=True)
return filtered_df
# remove specific subset from feature
def remove_subset(df,column,lst):
all_features = list(df[column].value_counts().index)
keep_features = lst
# subset data
subset_data = dict(tuple(df.groupby(column)))
set_all_features = set(all_features)
set_keep_features = set(lst)
# features of dictionary which should remain
remaining_features = set_all_features - set_keep_features
subset_data_filter = lambda x, y: dict([ (i,x[i]) for i in x if i in set(y) ])
filtered_data = subset_data_filter(subset_data,remaining_features)
filtered_df = pd.concat(filtered_data.values())
filtered_df.reset_index(drop=True,inplace=True)
return filtered_df
lst_keep = ['credit reporting', 'debt collection', 'mortgage', 'credit card',
'bank account or service', 'consumer Loan', 'student loan',
'payday loan', 'prepaid card', 'money transfers',
'other financial service', 'virtual currency']
df = keep_subset(df,'product',lst_keep)
# sdf = remove_subset(df,'product',lst_remove)
df['product'].value_counts()
debt collection 195373 mortgage 99141 student loan 33606 credit reporting 31587 credit card 18838 bank account or service 14885 payday loan 1746 money transfers 1497 prepaid card 1450 other financial service 292 virtual currency 16 Name: product, dtype: int64
SELECT SUBSET OF DATA¶
- Let's choose only a specific subset of time series data because to reduce the computational load
- We can note that data beyond 2017 has only three of the available subsets
df['year'].value_counts()
2016 73146 2017 59087 2015 51779 2021 50757 2022 42867 2018 41708 2020 40801 2019 38286 Name: year, dtype: int64
ldf = df.groupby('product').count()['day'].to_frame().sort_values(ascending=False,by='day')
ldf.style\
.bar(align='mid',
color=['#d65f5f','#F1A424'])
day | |
---|---|
product | |
debt collection | 195373 |
mortgage | 99141 |
student loan | 33606 |
credit reporting | 31587 |
credit card | 18838 |
bank account or service | 14885 |
payday loan | 1746 |
money transfers | 1497 |
prepaid card | 1450 |
other financial service | 292 |
virtual currency | 16 |
We still have 184 thousand samples to go by, which should still be sufficient for our task
df = df[df['year'].isin([2015,2016,2017])]
print(f'final shape: {df.shape}')
final shape: (184012, 22)
PRODUCT¶
Let's plot the grouped subset distribution for feature Product
fig = px.bar((df['product']
.value_counts(ascending=False)
.to_frame()),
x='product',
template='plotly_white',
title='Product Subset Distribution')
fig.update_traces(marker_line_color='#F1A424',
marker_line_width=0.1,
marker={'color':'#F1A424'},width=0.5)
fig.show("png")
- We can note that we have some target class inbalance, in both product and issue features,
- Lets stick stratification, we need to make sure each class is represented in both datasets
¶
The dataset contains a lot of categorical which can be grouped & analysed:
- Consumer Complaint Method (How was the complaint made?)
- Company to which the consumer complained (To which company was the complaint made)
- Consumer Complaint Timeline (When were the complaints made?)
- Consumer Complaint Timeline Group Trends (grouping data; are there any trends in the timeline)
- Consumer Address State (In which state was the complaint made?)
- Company Response to Consumer Complaints (What was the response to the complaint?)
- Consumer Response to company response (Did the consumer dispuse the response?)
CONSUMER COMPLAINT METHOD¶
- All customers addresses for which we have text data are submitted via the web
- This implies that all other forms of text data of complaints were not registered in the data for other forms of complaint submission
df['submitted via'].value_counts(ascending=True)
Web 184012 Name: submitted via, dtype: int64
FINANCIAL COMPANY¶
Let's visualise the top 10 companies for which we have consumer complaint feedback
# plot subset value counts
def plot_subset_counts(df,column,orient='h',top=None):
ldf = df[column].value_counts(ascending=False).to_frame()
ldf.columns = ['values']
if(top):
ldf = ldf[:top]
if(orient is 'h'):
fig = px.bar(data_frame=ldf,
x = ldf.index,
y = 'values',
template='plotly_white',
title='Subset Value-Counts')
elif('v'):
fig = px.bar(data_frame=ldf,
y = ldf.index,
x = 'values',
template='plotly_white',
title='Subset Value-Counts')
fig.update_layout(height=400)
fig.update_traces(marker_line_color='white',
marker_line_width=0.5,
marker={'color':'#F1A424'},
width=0.75)
fig.show("png")
plot_subset_counts(df,'company',orient='v',top=10)
COMPANY RESPONSE¶
Let's look at the data about what the company decided to do about the registered complaint
ldf = df['company public response'].value_counts(ascending=False).to_frame()
ldf.style\
.bar(align='mid',
color=['#3b3745','#F1A424'])
company public response | |
---|---|
Company has responded to the consumer and the CFPB and chooses not to provide a public response | 41598 |
Company chooses not to provide a public response | 18935 |
Company believes it acted appropriately as authorized by contract or law | 17907 |
Company believes the complaint is the result of a misunderstanding | 1839 |
Company disputes the facts presented in the complaint | 1792 |
Company believes complaint caused principally by actions of third party outside the control or direction of the company | 1436 |
Company believes complaint is the result of an isolated error | 1345 |
Company believes complaint represents an opportunity for improvement to better serve consumers | 813 |
Company can't verify or dispute the facts in the complaint | 768 |
Company believes complaint relates to a discontinued policy or procedure | 18 |
- Most of the consumer complaints were addressed in private as opposed to public
COMPLAINT TIMELINE¶
- Let's look at the weekly (using
resample
) complaint addresses (let's see if there are some trends in the time series data)
# Daily Complaints
complaints = df.copy()
complaints_daily = complaints.groupby(['date received']).agg("count")[["product"]] # daily addresses
# Sample weekly
complaints_weekly = complaints_daily.reset_index()
complaints_weekly = complaints_weekly.resample('W', on='date received').sum() # weekly addresses
fig = px.line(complaints_weekly,complaints_weekly.index,y="product",
template="plotly_white",title="Weekly Complaints",height=400)
fig.update_traces(line_color='#F1A424')
fig.show("png")
- We can observe an increasing trend in complaints registed (this could be because users simply regitered the complaints more)
- After April 2nd, 2017, there is a rapid decline in registered complaints
- Some interesting peaks with an unusually high number of registed complaints occured in 2017 (January,September)
- We can note that the pandemic had also an effect on the number of complaints registed
CONSUMER COMPLAINT TIMELINE TRENDS¶
Let's group all complaint data into groups for Day of the month (DoM), time of the year (ToY) & day of the week (DoW)
fig = px.bar(df['day'].value_counts(ascending=True).to_frame(),y='day',
template='plotly_white',height=300,
title='Day of the Month Complaint Trends')
fig.update_xaxes(tickvals = [i for i in range(0,32,1)])
fig.update_traces(marker_line_color='#F1A424',marker_line_width=0.1,
marker={'color':'#F1A424'},width=0.5)
fig.update_traces(textfont_size=12, textangle=0,
textposition="outside", cliponaxis=False)
fig.show("png")
fig = px.bar(df['month'].value_counts(ascending=False).to_frame(),y='month',
template='plotly_white',height=300,
title='Month of the Year Complaint Trends')
fig.update_xaxes(tickvals = [i for i in range(0,13,1)])
fig.update_traces(marker_line_color='#F1A424',marker_line_width=0.1,
marker={'color':'#F1A424'},width=0.5)
fig.show("png")
# By DoW
fig = px.bar(df['dow'].value_counts(ascending=False).to_frame(),y='dow',
template='plotly_white',height=300,
title='Day of the Week Complaint Trends')
fig.update_traces(marker_line_color='#F1A424',marker_line_width=0.1,
marker={'color':'#F1A424'},width=0.5)
fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
fig.show("png")
- Day of the month seems to be quite a cyclic trend, with lower numbers closer to the weekends
- July, August & September are associated with increased complaints, December, January & February associated with lower number of complaints
- Novermber, December, January & February don't have adequate data
- Tuesdays & Wednesdays tend to be the most common day a consumer will write a complaint
- Consumers don't tend to write complaints on weekends (Saturday,Sunday)
COMPLAINT ORIGIN (STATE) ¶
Let's investigate the distribution from which geographical state the complaint was made from
ldf = df['state'].value_counts(ascending=True).to_frame()[50:]
fig = px.bar(ldf,x='state',template='plotly_white',
title='State of Complaint',height=400)
fig.update_traces(marker_line_color='#F1A424',marker_line_width=1,
marker={'color':'#F1A424'},width=0.4)
fig.show("png")
- Most of the consumer complaints are from California, Florida, Texas, Georgia and New York
COMPARING RESPONSE TO DISPUTES¶
Lastly, we can combine the last two sections and see for each Product
, how many disputes there have been each month
disputed = df[['product','consumer disputed?','month']]
fig = px.histogram(disputed, y='product',
color='consumer disputed?',
template='plotly_white',
height = 700,
barmode='group',
color_discrete_sequence=['#F1A424','#3b3745'],
facet_col_wrap=3,
facet_col='month')
fig.update_layout(showlegend=False)
fig.update_layout(barmode="overlay")
fig.update_traces(opacity=0.5)
fig.show("png")
- Morgage & Debt Collection tend to be quite often disputed all year round
- September & October have the highest disputed cases for Morgage & Debt Collection
- July & August have an unusually high ammount of credit reporting complaints
¶
- We've done some exploratory data analysis & understand our problem target variables a little better, let's focus on preparing the data for machine learning
- In total, we have 1159430 complaints, although not evenly distributed as we saw in Section 3
- Our smallest class (virtual currency) has only 16 issues (which is very little data)
df['product'].value_counts(ascending=False).to_frame().sum()
product 184012 dtype: int64
df['product'].value_counts(ascending=False).to_frame().tail(3)
product | |
---|---|
prepaid card | 1450 |
other financial service | 292 |
virtual currency | 16 |
REVIEW CLASS DISBALANCE¶
There doesn't seem to be any error associated with labelling, so let's not remove this subset
Circle is a Boston-based financial services company that uses blockchain technology for its peer-to-peer payments and cryptocurrency-related products.
Despite wanting to utilise Stratification, it seems like we may not have enough data for the model to be able to classify complaints with little data available
print('Sample from virtual currency:')
vc = dict(tuple(df.groupby(by='product')))['virtual currency']
vc.iloc[[0]]['consumer complaint narrative'].values[0]
Sample from virtual currency:
'Signedup XXXX family members for referrals on Coinbase.com. Coinbase at that time offered {$75.00} for each person referred to their service. I referred all XXXX and they met the terms Coinbase intially offered. Signup took a while do to money transfer timeframes setup by Coinbase. In that time, Coinbase changed their promotion and terms to {$10.00} for referrals. When asked why, they said they could change terms at anytime ( even if signup up for {$75.00} referral bonus ) and that family members did not meet the terms either. Felt like they just change terms to disclude giving out referral bonuses.'
PRODUCT SUBSET AMBIGUITY¶
- We have a feature
Credit reporting, credit repair services, or other personal consumer reports
, it seems like this feature is not quite sorted - We already have a separate subgroup for Credit reporting, but not for cedit repair services or other consumer reports
- There is the possibility that this subgroup will contain complaints of other subgroups, which would affect the model accuracy
- Lets remove the subset: Credit reporting, credit repair services, or other personal consumer reports for the time being
Some of the approaches we could take are:
- Try to sort the data by keywords
# When we have a feature that contains subsets, we can remove unwanted subsets
def remove_subset(df,feature,lst_groups):
ndf = df.copy()
# Let's down sample all classes with frequencies above 4k
group = dict(tuple(ndf.groupby(by=feature)))
subset_group = list(group.keys()) # subsets in feature
# Check if features exist in columns
if(set(lst_groups).issubset(subset_group)):
# remove unwanted subset
for k in lst_groups:
group.pop(k, None)
df = pd.concat(list(group.values()))
df.reset_index(inplace=True,drop=True)
return df
- We are down to 611,233 complaints (about have of what we had)
- Let's also confirm that our function remove_subset works correctly
LIMIT TARGET CLASS SUBSET SAMPLES¶
We saw that we have quite a bit of data available
- Most of which are in particular classes (eg. credit collection ,debt reporting & mortgage)
- Let's limit our data to 4000 text samples from each class using function
downsample_features
# Downsample selected
def downsample_subset(df,feature,lst_groups,samples=4000):
ndf = df.copy()
# Let's down sample all classes with frequencies above 4k
group = dict(tuple(ndf.groupby(by=feature)))
subset_group = list(group.keys())
# Check if features exist in columns
if(set(lst_groups).issubset(subset_group)):
dict_downsamples = {}
for feature in lst_groups:
dict_downsamples[feature] = group[feature].sample(samples)
# remove old data
for k in lst_groups:
group.pop(k, None)
# read them back
group.update(dict_downsamples)
df = pd.concat(list(group.values()))
df.reset_index(inplace=True,drop=True)
return df
else:
print('feature not found in dataframe')
# Select subset features which have more than 4000 samples
subset_list = list(df['product'].value_counts()[df['product'].value_counts().values > 4000].index)
df = downsample_subset(df,'product',subset_list,samples=4000)
- Let's check if our function
downsample_subset
works correctly
df['product'].value_counts()
debt collection 4000 mortgage 4000 credit reporting 4000 credit card 4000 student loan 4000 bank account or service 4000 payday loan 1746 money transfers 1497 prepaid card 1450 other financial service 292 virtual currency 16 Name: product, dtype: int64
TRAIN-TEST SUBSET SPLITTING¶
- Next, as per standard requirements, we need to be able to validate the model after training
- We will split the data into two groups, training & validation subsets with a validation size of 0.2
- Stratification will also be applied to both groups, in order to guarantee all classes in both subgroups
# Select only relevant data
df_data = df[['consumer complaint narrative','product']]
df_data.columns = ['text','label']
df_data.head()
text | label | |
---|---|---|
0 | I made a wire transfer through Citibank to XXX... | money transfers |
1 | I purchased a money order on XX/XX/2016 ( to c... | money transfers |
2 | I have complained of false online transfer num... | money transfers |
3 | I paid by bank wire transfer on XXXX/XXXX/XXXX... | money transfers |
4 | I found a XXXX Bulldog for sale on XXXX after ... | money transfers |
from sklearn.model_selection import train_test_split as tts
train_files,test_files, train_labels, test_labels = tts(df_data['text'],
df_data['label'],
test_size=0.1,
random_state=32,
stratify=df_data['label'])
train_files = pd.DataFrame(train_files)
test_files = pd.DataFrame(test_files)
train_files['label'] = train_labels
test_files['label'] = test_labels
print(type(train_files))
print('Training Data',train_files.shape)
print('Validation Data',test_files.shape)
<class 'pandas.core.frame.DataFrame'> Training Data (26100, 2) Validation Data (2901, 2)
import plotly.express as px
train_values = train_files['label'].value_counts()
test_values = test_files['label'].value_counts()
visual = pd.concat([train_values,test_values],axis=1)
visual = visual.T
visual.index = ['train','test']
fig = px.bar(visual,template='plotly_white',
barmode='group',text_auto=True,height=300,
title='Train/Test Split Distribution')
fig.show("png")
¶
In order to create a baseline model, let's extract the hidden state data from DistilBERT
and use it as features for our linear model
GENERATING DATASET¶
- We have a daframe containing text & label data for both training & validation datasets
- We'll use HF's more intuitive to use
Dataset
class (which allows us to convert between types very easily)
train_files
text | label | |
---|---|---|
8290 | I paid off all of my bills and should not have... | debt collection |
1520 | Several checks were issued from XXXX for possi... | other financial service |
23071 | In XXXX, we my husband and myself took out a l... | student loan |
4123 | I use an Amex Serve card ( a prepaid debit car... | prepaid card |
16470 | Despite YEARS of stellar credit reports and sc... | credit reporting |
... | ... | ... |
21733 | I know that I am victim of student loan scam. ... | student loan |
27578 | On XX/XX/XXXX I made a payment of {$380.00} to... | bank account or service |
26297 | XXXX XXXX XXXX XXXX XXXX, AZ XXXX : ( XXXX ) X... | bank account or service |
17867 | After nearly a decade of business with Bank of... | credit card |
144 | On XXXX XXXX, 2015, I made a purchase on EBay.... | money transfers |
26100 rows × 2 columns
import transformers
transformers.logging.set_verbosity_error()
import warnings; warnings.filterwarnings('ignore')
import os; os.environ['WANDB_DISABLED'] = 'true'
from datasets import Dataset,Features,Value,ClassLabel, DatasetDict
traindts = Dataset.from_pandas(train_files)
traindts = traindts.class_encode_column("label")
testdts = Dataset.from_pandas(test_files)
testdts = testdts.class_encode_column("label")
# Pandas indicies not reset ie. __index_level_0__ additional column is added, resetting index
corpus = DatasetDict({"train" : traindts ,
"validation" : testdts })
corpus['train']
Dataset({ features: ['text', 'label', '__index_level_0__'], num_rows: 26100 })
TOKENISATION¶
- Time to tokenise the text data
- We'll be using the
AutoTokenizer
class from a pretrained modeldistilbert-base-uncased
in order to generate subword tokens
from transformers import AutoTokenizer
# Load parameters of the tokeniser
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
# Tokenisation function
def tokenise(batch):
return tokenizer(batch["text"],
padding=True,
truncation=True)
# apply to the entire dataset (train,test and validation dataset)
corpus_tokenised = corpus.map(tokenise,
batched=True,
batch_size=None)
print(corpus_tokenised["train"].column_names)
['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask']
LOADING PRESET MODEL¶
HuggingFace allows us to load a variety of pretrained models:
- Let's utlilise the distilbert-base-uncased model
The model was trained to predict [mask] values
- Given an input sequence (the above link shows an example)
Let's use it to extract the last hidden state of each input sequence
- Use it to train a more tradition machine learning model (baseline M1 model)
from transformers import AutoModel
import torch
# load a pretrained transformer model
model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
# move model to device
model = AutoModel.from_pretrained(model_ckpt).to(device)
cuda huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
EXTRACT MODEL HIDDEN STATE¶
- For each tokenised input
text
, we can utilise the loaded model & extract the last hidden state that can be used as features for machine learning models - The same strategy was applied in notebook Twitter Emotion Classification
# Function used to store last_hidden_state data of distilbert-base-uncased
def extract_hidden_states(batch):
# Place model inputs on the GPU
inputs = {k:v.to(device) for k,v in batch.items()
if k in tokenizer.model_input_names}
# Extract last hidden states
with torch.no_grad():
last_hidden_state = model(**inputs).last_hidden_state
# Return vector for [CLS] token
return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}
TO TENSORS¶
Before training using the pytorch
model, we need to set the corresponding format using set_format
# Change Data to Torch tensor
corpus_tokenised.set_format("torch",
columns=["input_ids", "attention_mask", "label"])
corpus_tokenised
DatasetDict({ train: Dataset({ features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'], num_rows: 26100 }) validation: Dataset({ features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'], num_rows: 2901 }) })
INFERENCE & EXTRACT HIDDEN STATE¶
Uisng map
we can apply the function extract_hidden_states
to both training & validation datasets
# Extract last hidden states (faster w/ GPU)
corpus_hidden = corpus_tokenised.map(extract_hidden_states,
batched=True,
batch_size=32)
corpus_hidden["train"].column_names
['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask', 'hidden_state']
# Empty cache
torch.cuda.empty_cache()
The extracted hidden state corpus corpus_hidden
has been uploaded to dataset
# Save our data
corpus_hidden.set_format(type="pandas")
# Add label data to dataframe
def label_int2str(row):
return corpus["train"].features["label"].int2str(row)
ldf = corpus_hidden["train"][:]
ldf["label_name"] = ldf["label"].apply(label_int2str)
ldf.to_pickle('training.df')
ldf = corpus_hidden["validation"][:]
ldf["label_name"] = ldf["label"].apply(label_int2str)
ldf.to_pickle('validation.df')
!ls /kaggle/working/
# !ls /kaggle/input/hiddenstatedata/
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) __notebook__.ipynb training.df validation.df
DEFINING HIDDEN STATE DATASET¶
- Having extracted the last hidden state data from model distilbert-base-uncased, let's define the training data
- The process is quite long so we'll load the data saved in the last section from training.csv & validation.csv
# reload saved data
import pandas as pd
import pickle
# Load hidden state data
# training = pd.read_pickle('/kaggle/input/hiddenstatedata/training.df')
# validation = pd.read_pickle('/kaggle/input/hiddenstatedata/validation.df')
training = pd.read_pickle('training.df')
validation = pd.read_pickle('validation.df')
training.head()
labels = training[['label','label_name']]
label = []
for i in labels.label.unique():
label.append(labels[labels['label'] == i].iloc[[0]]['label_name'].values[0])
label
['debt collection', 'other financial service', 'student loan', 'prepaid card', 'credit reporting', 'mortgage', 'payday loan', 'credit card', 'bank account or service', 'money transfers', 'virtual currency']
# # Define our training & validation datasets
import numpy as np
X_train = np.stack(training['hidden_state'])
X_valid = np.stack(validation["hidden_state"])
y_train = np.array(training["label"])
y_valid = np.array(validation["label"])
print(f'Training Dataset: {X_train.shape}')
print(f'Validation Dataset {X_valid.shape}')
Training Dataset: (26100, 768) Validation Dataset (2901, 768)
TRAIN MODEL¶
Let's start with something quite simplistic, LogisticRegression
often work quite well
%%time
from sklearn.linear_model import LogisticRegression as LR
# We increase `max_iter` to guarantee convergence
lr_clf = LR(max_iter = 2000)
lr_clf.fit(X_train, y_train)
CPU times: user 8min 59s, sys: 39.2 s, total: 9min 38s Wall time: 5min 5s
LogisticRegression(max_iter=2000)
# Predictions
y_preds_train = lr_clf.predict(X_train)
y_preds_valid = lr_clf.predict(X_valid)
print('LogisticRegression:')
print(f'training accuracy: {round(lr_clf.score(X_train, y_train),3)}')
print(f'validation accuracy: {round(lr_clf.score(X_valid, y_valid),3)}')
LogisticRegression: training accuracy: 0.807 validation accuracy: 0.772
# save sklearn model
import joblib
filename = 'classifier.joblib.pkl'
_ = joblib.dump(lr_clf, filename, compress=9)
# load sklearn model
# lr_clf = joblib.load('/kaggle/input/hiddenstatedata/' + filename)
# lr_clf
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
def plot_confusion_matrix(y_model, y_true, labels):
cm = confusion_matrix(y_true,y_model,normalize='true')
fig, ax = plt.subplots(figsize=(8,8))
disp = ConfusionMatrixDisplay(confusion_matrix=cm.round(2).copy(), display_labels=labels)
disp.plot(ax=ax, colorbar=False)
plt.title("Confusion matrix")
plt.xticks(rotation = 90) # Rotates X-Axis Ticks by 45-degrees
plt.tight_layout()
plt.show()
labels = list(training.label_name.value_counts().index)
# Validation Dataset Confusion Matrix
plot_confusion_matrix(y_preds_valid, y_valid, labels)
- Compressed distilbert embedding features work quite well in this this problem
- Looks like we have quite a good model to begin with, scoring a validation accuracy of 0.77
- We can note that the model payday loan & virtual currency are very pooly predicted subsets
- other financial services is predicted quite well (which is surprising because for the transformer model, we have the opposite)
¶
- With the fine-tune approach, we do not use the hidden states as fixed features
- Instead, we train them from a given model state
- This requires the classification head to be differentiable (neural network for classification)
LOAD PRETRAINED MODEL¶
- We'll load the same DistilBERT model using
model_ckpt
"distilbert-base-uncased" - This time however we will be loading
AutoModelForSequenceClassification
(we usedAutoModel
when we extracted embedding features) AutoModelForSequenceClassification
model has a classification head on top of the pretrained model outputs- We only need to specify the number of labels the model has to predict
num_labels
# Empty cache
torch.cuda.empty_cache()
# Change Data to Torch tensor
corpus_tokenised.set_format("torch",
columns=["input_ids", "attention_mask", "label"])
corpus_tokenised
DatasetDict({ train: Dataset({ features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'], num_rows: 26100 }) validation: Dataset({ features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'], num_rows: 2901 }) })
from transformers import AutoModelForSequenceClassification
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_ckpt = "distilbert-base-uncased"
model = (AutoModelForSequenceClassification
.from_pretrained(model_ckpt,
num_labels=len(labels))
.to(device))
EVALUATION METRICS¶
We'll monitor the F1 score
& accuracy
, the function is required to be passed in the Trainer
class
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels, preds, average="weighted")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1}
DEFINE TRAINER¶
- Next we need to define the model training parameters, which can be done using
TrainingArguments
- Let's train the DistilBERT model for 3 iterations with a learning rate of 2e-5 and a batch size of 64
- The
Trainer
requires inputs of a model, model arguments, metrics, the datasets (train,validation) & the tokeniser
from transformers import Trainer, TrainingArguments
from transformers import Trainer
bs = 16 # batch size
model_name = f"{model_ckpt}-finetuned-financial"
labels = corpus_tokenised["train"].features["label"].names
# Training Arguments
training_args = TrainingArguments(output_dir=model_name,
num_train_epochs=3, # number of training epochs
learning_rate=2e-5, # model learning rate
per_device_train_batch_size=bs, # batch size
per_device_eval_batch_size=bs, # batch size
weight_decay=0.01,
evaluation_strategy="epoch",
disable_tqdm=False,
report_to="none",
push_to_hub=False,
log_level="error")
trainer = Trainer(model=model, # Model
args=training_args, # Training arguments (above)
compute_metrics=compute_metrics, # Computational Metrics
train_dataset=corpus_tokenised["train"], # Training Dataset
eval_dataset=corpus_tokenised["validation"], # Evaluation Dataset
tokenizer=tokenizer)
TRAIN MODEL¶
Let's finally fine-tune our transform model to fit our classification problem
%%time
# Train & save model
trainer.train()
trainer.save_model()
Epoch | Training Loss | Validation Loss | Accuracy | F1 |
---|---|---|---|---|
1 | 0.535000 | 0.528666 | 0.842468 | 0.839433 |
2 | 0.410300 | 0.483131 | 0.858669 | 0.855979 |
3 | 0.312500 | 0.477747 | 0.866942 | 0.864213 |
CPU times: user 36min 55s, sys: 16.5 s, total: 37min 12s Wall time: 37min 21s
# from transformers import pipeline
# load from previously saved model
# classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-financial")
INFERENCE¶
Let's utilise our fine-tuned transformer model for inference on the validation dataset
# Predict on Validation Dataset
pred_output = trainer.predict(corpus_tokenised["validation"])
pred_output
PredictionOutput(predictions=array([[ 0.4898075 , -1.1539025 , -2.4000468 , ..., 2.664791 , -1.7696137 , -1.85924 ], [ 5.524978 , 1.4433606 , -1.2144918 , ..., -2.0602236 , -2.2785227 , -3.1501598 ], [-0.99464035, 1.410322 , 5.1908035 , ..., -2.1883903 , -1.5084958 , -3.9436107 ], ..., [ 0.8029568 , 2.751484 , -1.0246688 , ..., -2.6813054 , 0.32632264, -3.8141603 ], [ 0.27522528, -1.2025931 , -0.14139701, ..., -2.684458 , -1.0188793 , -3.16859 ], [-0.17902231, -1.83018 , -0.8901656 , ..., -3.2526188 , 0.6185259 , -3.5121055 ]], dtype=float32), label_ids=array([4, 1, 2, ..., 0, 5, 5]), metrics={'test_loss': 0.47774738073349, 'test_accuracy': 0.8669424336435712, 'test_f1': 0.8642125632561599, 'test_runtime': 26.9558, 'test_samples_per_second': 107.621, 'test_steps_per_second': 6.752})
print(f'Output Predition: {pred_output.predictions.shape}')
print(pred_output.predictions)
Output Predition: (2901, 11) [[ 0.4898075 -1.1539025 -2.4000468 ... 2.664791 -1.7696137 -1.85924 ] [ 5.524978 1.4433606 -1.2144918 ... -2.0602236 -2.2785227 -3.1501598 ] [-0.99464035 1.410322 5.1908035 ... -2.1883903 -1.5084958 -3.9436107 ] ... [ 0.8029568 2.751484 -1.0246688 ... -2.6813054 0.32632264 -3.8141603 ] [ 0.27522528 -1.2025931 -0.14139701 ... -2.684458 -1.0188793 -3.16859 ] [-0.17902231 -1.83018 -0.8901656 ... -3.2526188 0.6185259 -3.5121055 ]]
import numpy as np
# Decode the predictions greedily using argmax (highest value of all classes)
y_preds = np.argmax(pred_output.predictions,axis=1)
print(f'Output Prediction:{y_preds.shape}')
print(f'Predictions: {y_preds}')
Output Prediction:(2901,) Predictions: [4 0 2 ... 1 5 5]
# Validation
plot_confusion_matrix(y_preds,y_valid,labels)
- Fine-tuning a pretrained-transformed works sufficiently better than our baseline approach
- virtual currency is also not predicted very well (same as the linear model)
- payday loan is predicted well but other financial services poorly