Financial consumer complaint analysis

In [1]:

Copied!

pip install -U kaleido
pip install -U kaleido

Collecting kaleido
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.9/79.9 MB 13.6 MB/s eta 0:00:00
Installing collected packages: kaleido
Successfully installed kaleido-0.2.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Note: you may need to restart the kernel to use updated packages.

1 | BACKGROUND
¶

In this section we will outline what customer feedback is, why it is an important part of any business, not only for financial companies, but in general. Show some examples, which should show that it can take some time to manually read and analyse what each consumer complaint is about. How can we utilise consumer feedback to streamline the process of consumer-company interaction (the need)

CONSUMER FEEDBACK
¶

Let's point our some key points about consumer feedback:

Consumer feedback is an important part of day to day financial business operations Companies offering products must be able to know what their consumers think of their products Eg. positive & negative feedback & can be obtained from a number of sources (eg. twitter) In this case, we obtain data from a database, which registers consumers feedback of financial products Customers have specific issue on a number of topics they want want the company to address The form of consumer communication with the financial institution is via the web (as will be shown later)

USEFULNESS OF CONSUMER FEEDBACK

Can your customers tell you something important? | Source

If you run your own business, I know you do your best to please your customers
Satisfy their needs, and keep them loyal to your brand.
But how can you be sure that your efforts bring desired results?
If you do not try to find out what your clients think about your service
You will never be able to give them the best customer experience.
Their opinions about their experience with your brand are helpful information
That you can use to adjust your business to fit their needs more accurately

The source clearly outlines that consumer feedback is quite critical for any business
Consumer feedback in our problem is related to a consumer having an issue with a particualr financial product or alike

CONSUMER FEEDBACK EXAMPLES

Let's look at a couple of examples of a consumer's addresses to a company:

Product: Credit reporting | Issue: Incorrect information on credit report

After looking at my credit report I saw a collection account that does not belong to me. I am not allowed to dispute this information online on Experian or over the phone making it impossible for me. This false information is ruining my credit and knowing full well this people did not do their job and allow people to just post false accounts on my report. They need to delete this information immediately and do a proper investigation as this information is not mine. '

Product: Credit card | Issue: Credit line increase/decrease

"XXXX i receive an email from citibank regarding my XXXX credit card. It was an offer to request a credit increase and it clearly stated that there would be NO Credit bureau inquiry made. I clicked on the link in the email and entered the requested information. a couple of days later I received an alert from my credit bureau monitoring service that a hard inquiry was done. Upon looking at the report it showed Citibank credit cards making a hard credit inquiry which was completely opposite of what their email stated. I called citi and they confimred that the email stated there would be no creidt inquiry done however they said that the request was made on a different citibank credit card which is why the hard inquiry was made. I explained to the rep I clicked on the link they provided and if was for a different account of mine it was not my issue but theirs and they need to remove the inquiry. They told me to send a letter to their credit dispute department explaining it. I sent the letter after waiting more than a month I received a blunt statement stating the it was a valid credit request and they will not remove the inquiry from my credit bureau. Citi performed bait and switch by offering a no inquiry credit request and then doing a hard inquiry which has negatively affected my credit score. I asked to remove it and received a generic letter stating they would not with no number to contact the department that sent the letter when i called the main customer service number they said that department dosent talk to customers and there was nothing else they can do. This has negatively affected my credit score and will remain on my credit report for 2 years because citi 's False advertising. and then their lack of fixing their error "

After reading this long complaint:

It should become apparent that manual evaluation of each consumer issue can can a while to process and is very inefficient
For a timely & helpful consumer response, the relevant problem not only must be processed in a timely manner, but passed on to a specific expert that has experience dealing with the particular issue

2 | NOTEBOOK WORKFLOW
¶

STUDY AIM¶

In this notebook, we'll be utilising machine learning, to create a model(s) that will be able to classify the type of complaint (as we did above) (by both product & issue)
Such a model can be useful for a company to quickly understand the type of complaint & appoint a financial expert that will be able to solve the problem

AUTOMATED TICKET CLASSIFICATION MODEL¶

Our approach will include separate models, that will be in charge of classifying data on different subsets of data

M1 will be classifying a product based on the customer's input complaint (text) (eg. Credit Reporting)
M2 will be in charge of classifying the particular issue to which the complaint belongs to (text)

3 | DATA PREPROCESSING
¶

In this section, we will dive in to the dataset, making some slight adjustments; loading the data, looking at missing data, looking at the features & make some slight adustments

CONSUMER COMPLAINT DATABASE¶

Download full Dataset from the provided link if you want to have the up to date data
Complaints that the Consumer Financial Protection Bureau (CFPB) sends to companies for response are published in the Consumer Complaint Database after the company responds
Confirming a commercial relationship with the consumer, or after 15 days, whichever comes first

LOAD DATASET¶

We start off by loading the dataset (we are loading the dataset without any missing data in consumer complaint narrative (which is the complaint)

In [2]:

Copied!





%%time
import pandas as pd
import plotly.express as px
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('/kaggle/input/complaintsfull/main.csv',low_memory=False)
df = df.drop(['Unnamed: 0'],axis=1)
%%time
import pandas as pd
import plotly.express as px
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('/kaggle/input/complaintsfull/main.csv',low_memory=False)
df = df.drop(['Unnamed: 0'],axis=1)

CPU times: user 22.3 s, sys: 3.15 s, total: 25.5 s
Wall time: 36.9 s

TARGET LABELS¶

A quick glimps into the dataset gives us the view of the features that will be of interest to us in this study

Product (Type of financial product)
Sub-product (A more detailed subset of product)
Issue (What was the problem)
Sub-issue (A more detailed subset of product)

MISSING DATA¶

Visualise missing data in the dataset, looks like we have quite a bit overall & some in target variables (Sub-Product & Sub-Issue)

In [3]:

Copied!

import missingno as ms
ms.matrix(df)
import missingno as ms
ms.matrix(df)

Out[3]:

<AxesSubplot:>

No description has been provided for this image

We have quite a bit of missing data, we have already removed rows, which have missing data in our text target (consumer complaint narrative)
And quite a heavy dataset, let's utilise only the relevant data for our problem (by the end of this section) which should reduce the number of rows in our data significatntly

FEATURE DESCRIPTION¶

Brief summary of what each feature represents in our dataset

In [4]:

Copied!

print('Dataset Features')
df.columns
print('Dataset Features')
df.columns

Dataset Features

Out[4]:

Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Consumer consent provided?',
       'Submitted via', 'Date sent to company', 'Company response to consumer',
       'Timely response?', 'Consumer disputed?', 'Complaint ID', 'Year',
       'Month', 'Day', 'DoW'],
      dtype='object')

Date received - When the complaint was addressed
Product - Complaint Type
Issue - Brief summary of the issue
Consumer complaint narrative - What the customer wrote (documents)
Company public response - How did the company respond
State - State in which the complaint was made
Submitted - Form of complaint
Customer disputed? - Did the customer dispute the response

ADDING DATETIME FEATURES¶

Lets add time-series based features, normalise column and column subset names & remove some column subset data for our target variable

We have two timeline features, Date received & Date sent to company
Lets extract additional features which can be useful for EDA

In [5]:

Copied!





def object_to_datetime_features(df,column):

    df[column] = df[column].astype('datetime64[ns]')
    df['Year'] = df[column].dt.year
    df['Month'] = df[column].dt.month
    df['Day'] = df[column].dt.day
    df['DoW'] = df[column].dt.dayofweek
    df['DoW'] = df['DoW'].replace({0:'Monday',1:'Tuesday',2:'Wednesday',
                                   3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'})
    return df

df = object_to_datetime_features(df,'Date received')
df.columns
def object_to_datetime_features(df,column):

    df[column] = df[column].astype('datetime64[ns]')
    df['Year'] = df[column].dt.year
    df['Month'] = df[column].dt.month
    df['Day'] = df[column].dt.day
    df['DoW'] = df[column].dt.dayofweek
    df['DoW'] = df['DoW'].replace({0:'Monday',1:'Tuesday',2:'Wednesday',
                                   3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'})
    return df

df = object_to_datetime_features(df,'Date received')
df.columns

Out[5]:

Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Consumer consent provided?',
       'Submitted via', 'Date sent to company', 'Company response to consumer',
       'Timely response?', 'Consumer disputed?', 'Complaint ID', 'Year',
       'Month', 'Day', 'DoW'],
      dtype='object')

COLUMN NAME NORMALISATION¶

Lets convert all column names to a lower register

In [6]:

Copied!

# lower the register of columns

def normalise_column_names(df):
    
    normalised_features = [i.lower() for i in list(df.columns)]
    df.columns = normalised_features
    return df

df = normalise_column_names(df)
# lower the register of columns

def normalise_column_names(df):
    
    normalised_features = [i.lower() for i in list(df.columns)]
    df.columns = normalised_features
    return df

df = normalise_column_names(df)

NORMALISATION OF SUBSET NAMES¶

Lets convert all subset feature names into a lower register as well

In [7]:

Copied!

# show the names of each subset

def show_subset_names(df,column):    
    return df[column].value_counts().index

show_subset_names(df,'product')
# show the names of each subset

def show_subset_names(df,column):    
    return df[column].value_counts().index

show_subset_names(df,'product')

Out[7]:

Index(['Credit reporting, credit repair services, or other personal consumer reports',
       'Debt collection', 'Mortgage', 'Credit card or prepaid card',
       'Checking or savings account', 'Student loan', 'Credit reporting',
       'Money transfer, virtual currency, or money service',
       'Vehicle loan or lease', 'Credit card', 'Bank account or service',
       'Payday loan, title loan, or personal loan', 'Consumer Loan',
       'Payday loan', 'Money transfers', 'Prepaid card',
       'Other financial service', 'Virtual currency'],
      dtype='object')

In [8]:

Copied!





def normalise_subset_names(df,column):
    subset_names = list(df[column].value_counts().index)
    norm_subset_names = [i.lower() for i in subset_names]
    dict_replace = dict(zip(subset_names,norm_subset_names))
    df[column] = df[column].replace(dict_replace)    
    return df

df = normalise_subset_names(df,'product')
show_subset_names(df,'product')
def normalise_subset_names(df,column):
    subset_names = list(df[column].value_counts().index)
    norm_subset_names = [i.lower() for i in subset_names]
    dict_replace = dict(zip(subset_names,norm_subset_names))
    df[column] = df[column].replace(dict_replace)    
    return df

df = normalise_subset_names(df,'product')
show_subset_names(df,'product')

Out[8]:

Index(['credit reporting, credit repair services, or other personal consumer reports',
       'debt collection', 'mortgage', 'credit card or prepaid card',
       'checking or savings account', 'student loan', 'credit reporting',
       'money transfer, virtual currency, or money service',
       'vehicle loan or lease', 'credit card', 'bank account or service',
       'payday loan, title loan, or personal loan', 'consumer loan',
       'payday loan', 'money transfers', 'prepaid card',
       'other financial service', 'virtual currency'],
      dtype='object')

FILTER SUBSET DATA¶

Let's keep only specific subsets of data in the product column

In [9]:

Copied!





# keep only specific subset in a feature

def keep_subset(df,column,lst):
    
    all_features = list(df[column].value_counts().index)
    keep_features = lst
    
    # subset data
    subset_data = dict(tuple(df.groupby(column)))
    subset_data_filter = lambda x, y: dict([ (i,x[i]) for i in x if i in set(y) ])
    
    # dictionary with only selected keys
    filtered_data = subset_data_filter(subset_data,lst)
    filtered_df = pd.concat(filtered_data.values())
    filtered_df.reset_index(drop=True,inplace=True)
    return filtered_df
    
# remove specific subset from feature

def remove_subset(df,column,lst):
    
    all_features = list(df[column].value_counts().index)
    keep_features = lst
    
    # subset data
    subset_data = dict(tuple(df.groupby(column)))
    set_all_features = set(all_features)
    set_keep_features = set(lst)
    
    # features of dictionary which should remain
    remaining_features = set_all_features - set_keep_features
 
    subset_data_filter = lambda x, y: dict([ (i,x[i]) for i in x if i in set(y) ])
    filtered_data = subset_data_filter(subset_data,remaining_features)
    filtered_df = pd.concat(filtered_data.values())
    filtered_df.reset_index(drop=True,inplace=True)
    return filtered_df
# keep only specific subset in a feature

def keep_subset(df,column,lst):
    
    all_features = list(df[column].value_counts().index)
    keep_features = lst
    
    # subset data
    subset_data = dict(tuple(df.groupby(column)))
    subset_data_filter = lambda x, y: dict([ (i,x[i]) for i in x if i in set(y) ])
    
    # dictionary with only selected keys
    filtered_data = subset_data_filter(subset_data,lst)
    filtered_df = pd.concat(filtered_data.values())
    filtered_df.reset_index(drop=True,inplace=True)
    return filtered_df
    
# remove specific subset from feature

def remove_subset(df,column,lst):
    
    all_features = list(df[column].value_counts().index)
    keep_features = lst
    
    # subset data
    subset_data = dict(tuple(df.groupby(column)))
    set_all_features = set(all_features)
    set_keep_features = set(lst)
    
    # features of dictionary which should remain
    remaining_features = set_all_features - set_keep_features
 
    subset_data_filter = lambda x, y: dict([ (i,x[i]) for i in x if i in set(y) ])
    filtered_data = subset_data_filter(subset_data,remaining_features)
    filtered_df = pd.concat(filtered_data.values())
    filtered_df.reset_index(drop=True,inplace=True)
    return filtered_df

In [10]:

Copied!





lst_keep = ['credit reporting', 'debt collection', 'mortgage', 'credit card',
            'bank account or service', 'consumer Loan', 'student loan',
            'payday loan', 'prepaid card', 'money transfers',
            'other financial service', 'virtual currency']

df = keep_subset(df,'product',lst_keep)
# sdf = remove_subset(df,'product',lst_remove)
df['product'].value_counts()
lst_keep = ['credit reporting', 'debt collection', 'mortgage', 'credit card',
            'bank account or service', 'consumer Loan', 'student loan',
            'payday loan', 'prepaid card', 'money transfers',
            'other financial service', 'virtual currency']

df = keep_subset(df,'product',lst_keep)
# sdf = remove_subset(df,'product',lst_remove)
df['product'].value_counts()

Out[10]:

debt collection            195373
mortgage                    99141
student loan                33606
credit reporting            31587
credit card                 18838
bank account or service     14885
payday loan                  1746
money transfers              1497
prepaid card                 1450
other financial service       292
virtual currency               16
Name: product, dtype: int64

SELECT SUBSET OF DATA¶

Let's choose only a specific subset of time series data because to reduce the computational load
We can note that data beyond 2017 has only three of the available subsets

In [11]:

Copied!

df['year'].value_counts()
df['year'].value_counts()

Out[11]:

2016    73146
2017    59087
2015    51779
2021    50757
2022    42867
2018    41708
2020    40801
2019    38286
Name: year, dtype: int64

In [12]:

Copied!

ldf = df.groupby('product').count()['day'].to_frame().sort_values(ascending=False,by='day')

ldf.style\
    .bar(align='mid',
         color=['#d65f5f','#F1A424'])
ldf = df.groupby('product').count()['day'].to_frame().sort_values(ascending=False,by='day')

ldf.style\
    .bar(align='mid',
         color=['#d65f5f','#F1A424'])

Out[12]:

	day
product
debt collection	195373
mortgage	99141
student loan	33606
credit reporting	31587
credit card	18838
bank account or service	14885
payday loan	1746
money transfers	1497
prepaid card	1450
other financial service	292
virtual currency	16

We still have 184 thousand samples to go by, which should still be sufficient for our task

In [13]:

Copied!

df = df[df['year'].isin([2015,2016,2017])]
print(f'final shape: {df.shape}')
df = df[df['year'].isin([2015,2016,2017])]
print(f'final shape: {df.shape}')

final shape: (184012, 22)

4 | TARGET DISTRIBUTION
¶

The target variable in our problem:

Product Value Distribution (target variable for M1)

PRODUCT¶

Let's plot the grouped subset distribution for feature Product

In [14]:

Copied!





fig = px.bar((df['product']
              .value_counts(ascending=False)
              .to_frame()),
             x='product',
             template='plotly_white',
             title='Product Subset Distribution')

fig.update_traces(marker_line_color='#F1A424',
                  marker_line_width=0.1,
                  marker={'color':'#F1A424'},width=0.5)

fig.show("png")
fig = px.bar((df['product']
              .value_counts(ascending=False)
              .to_frame()),
             x='product',
             template='plotly_white',
             title='Product Subset Distribution')

fig.update_traces(marker_line_color='#F1A424',
                  marker_line_width=0.1,
                  marker={'color':'#F1A424'},width=0.5)

fig.show("png")

We can note that we have some target class inbalance, in both product and issue features,
Lets stick stratification, we need to make sure each class is represented in both datasets

5 | EXPLORATORY DATA ANALYSIS
¶

The dataset contains a lot of categorical which can be grouped & analysed:

Consumer Complaint Method (How was the complaint made?)
Company to which the consumer complained (To which company was the complaint made)
Consumer Complaint Timeline (When were the complaints made?)
Consumer Complaint Timeline Group Trends (grouping data; are there any trends in the timeline)
Consumer Address State (In which state was the complaint made?)
Company Response to Consumer Complaints (What was the response to the complaint?)
Consumer Response to company response (Did the consumer dispuse the response?)

CONSUMER COMPLAINT METHOD¶

All customers addresses for which we have text data are submitted via the web
This implies that all other forms of text data of complaints were not registered in the data for other forms of complaint submission

In [16]:

Copied!

df['submitted via'].value_counts(ascending=True)
df['submitted via'].value_counts(ascending=True)

Out[16]:

Web    184012
Name: submitted via, dtype: int64

FINANCIAL COMPANY¶

Let's visualise the top 10 companies for which we have consumer complaint feedback

In [17]:

Copied!





# plot subset value counts 

def plot_subset_counts(df,column,orient='h',top=None):
    
    ldf = df[column].value_counts(ascending=False).to_frame()
    ldf.columns = ['values']
    
    if(top):
        ldf = ldf[:top]
    
    if(orient is 'h'):
        fig = px.bar(data_frame=ldf,
                     x = ldf.index,
                     y = 'values',
                     template='plotly_white',
                     title='Subset Value-Counts')

    elif('v'):

        fig = px.bar(data_frame=ldf,
                             y = ldf.index,
                             x = 'values',
                             template='plotly_white',
                             title='Subset Value-Counts')
        
    fig.update_layout(height=400)
    fig.update_traces(marker_line_color='white',
                      marker_line_width=0.5,
                      marker={'color':'#F1A424'},
                      width=0.75)
    
    fig.show("png")
# plot subset value counts 

def plot_subset_counts(df,column,orient='h',top=None):
    
    ldf = df[column].value_counts(ascending=False).to_frame()
    ldf.columns = ['values']
    
    if(top):
        ldf = ldf[:top]
    
    if(orient is 'h'):
        fig = px.bar(data_frame=ldf,
                     x = ldf.index,
                     y = 'values',
                     template='plotly_white',
                     title='Subset Value-Counts')

    elif('v'):

        fig = px.bar(data_frame=ldf,
                             y = ldf.index,
                             x = 'values',
                             template='plotly_white',
                             title='Subset Value-Counts')
        
    fig.update_layout(height=400)
    fig.update_traces(marker_line_color='white',
                      marker_line_width=0.5,
                      marker={'color':'#F1A424'},
                      width=0.75)
    
    fig.show("png")

In [18]:

Copied!

plot_subset_counts(df,'company',orient='v',top=10)
plot_subset_counts(df,'company',orient='v',top=10)

COMPANY RESPONSE¶

Let's look at the data about what the company decided to do about the registered complaint

In [19]:

Copied!

ldf = df['company public response'].value_counts(ascending=False).to_frame()

ldf.style\
    .bar(align='mid',
         color=['#3b3745','#F1A424'])
ldf = df['company public response'].value_counts(ascending=False).to_frame()

ldf.style\
    .bar(align='mid',
         color=['#3b3745','#F1A424'])

Out[19]:

	company public response
Company has responded to the consumer and the CFPB and chooses not to provide a public response	41598
Company chooses not to provide a public response	18935
Company believes it acted appropriately as authorized by contract or law	17907
Company believes the complaint is the result of a misunderstanding	1839
Company disputes the facts presented in the complaint	1792
Company believes complaint caused principally by actions of third party outside the control or direction of the company	1436
Company believes complaint is the result of an isolated error	1345
Company believes complaint represents an opportunity for improvement to better serve consumers	813
Company can't verify or dispute the facts in the complaint	768
Company believes complaint relates to a discontinued policy or procedure	18

Most of the consumer complaints were addressed in private as opposed to public

COMPLAINT TIMELINE¶

Let's look at the weekly (using resample) complaint addresses (let's see if there are some trends in the time series data)

In [20]:

Copied!





# Daily Complaints
complaints = df.copy()
complaints_daily = complaints.groupby(['date received']).agg("count")[["product"]] # daily addresses

# Sample weekly
complaints_weekly = complaints_daily.reset_index()
complaints_weekly = complaints_weekly.resample('W', on='date received').sum() # weekly addresses

fig = px.line(complaints_weekly,complaints_weekly.index,y="product",
              template="plotly_white",title="Weekly Complaints",height=400)
fig.update_traces(line_color='#F1A424')
fig.show("png")
# Daily Complaints
complaints = df.copy()
complaints_daily = complaints.groupby(['date received']).agg("count")[["product"]] # daily addresses

# Sample weekly
complaints_weekly = complaints_daily.reset_index()
complaints_weekly = complaints_weekly.resample('W', on='date received').sum() # weekly addresses

fig = px.line(complaints_weekly,complaints_weekly.index,y="product",
              template="plotly_white",title="Weekly Complaints",height=400)
fig.update_traces(line_color='#F1A424')
fig.show("png")

We can observe an increasing trend in complaints registed (this could be because users simply regitered the complaints more)
After April 2nd, 2017, there is a rapid decline in registered complaints
Some interesting peaks with an unusually high number of registed complaints occured in 2017 (January,September)
We can note that the pandemic had also an effect on the number of complaints registed

CONSUMER COMPLAINT TIMELINE TRENDS¶

Let's group all complaint data into groups for Day of the month (DoM), time of the year (ToY) & day of the week (DoW)

In [21]:

Copied!





fig = px.bar(df['day'].value_counts(ascending=True).to_frame(),y='day',
             template='plotly_white',height=300,
             title='Day of the Month Complaint Trends')
fig.update_xaxes(tickvals = [i for i in range(0,32,1)])
fig.update_traces(marker_line_color='#F1A424',marker_line_width=0.1,
                  marker={'color':'#F1A424'},width=0.5)
fig.update_traces(textfont_size=12, textangle=0, 
                  textposition="outside", cliponaxis=False)
fig.show("png")
fig = px.bar(df['day'].value_counts(ascending=True).to_frame(),y='day',
             template='plotly_white',height=300,
             title='Day of the Month Complaint Trends')
fig.update_xaxes(tickvals = [i for i in range(0,32,1)])
fig.update_traces(marker_line_color='#F1A424',marker_line_width=0.1,
                  marker={'color':'#F1A424'},width=0.5)
fig.update_traces(textfont_size=12, textangle=0, 
                  textposition="outside", cliponaxis=False)
fig.show("png")

In [22]:

Copied!





fig = px.bar(df['month'].value_counts(ascending=False).to_frame(),y='month',
             template='plotly_white',height=300,
             title='Month of the Year Complaint Trends')
fig.update_xaxes(tickvals = [i for i in range(0,13,1)])
fig.update_traces(marker_line_color='#F1A424',marker_line_width=0.1,
                  marker={'color':'#F1A424'},width=0.5)
fig.show("png")
fig = px.bar(df['month'].value_counts(ascending=False).to_frame(),y='month',
             template='plotly_white',height=300,
             title='Month of the Year Complaint Trends')
fig.update_xaxes(tickvals = [i for i in range(0,13,1)])
fig.update_traces(marker_line_color='#F1A424',marker_line_width=0.1,
                  marker={'color':'#F1A424'},width=0.5)
fig.show("png")

In [23]:

Copied!





# By DoW
fig = px.bar(df['dow'].value_counts(ascending=False).to_frame(),y='dow',
             template='plotly_white',height=300,
             title='Day of the Week Complaint Trends')
fig.update_traces(marker_line_color='#F1A424',marker_line_width=0.1,
                  marker={'color':'#F1A424'},width=0.5)
fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
fig.show("png")
# By DoW
fig = px.bar(df['dow'].value_counts(ascending=False).to_frame(),y='dow',
             template='plotly_white',height=300,
             title='Day of the Week Complaint Trends')
fig.update_traces(marker_line_color='#F1A424',marker_line_width=0.1,
                  marker={'color':'#F1A424'},width=0.5)
fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
fig.show("png")

Day of the month seems to be quite a cyclic trend, with lower numbers closer to the weekends
July, August & September are associated with increased complaints, December, January & February associated with lower number of complaints
Novermber, December, January & February don't have adequate data
Tuesdays & Wednesdays tend to be the most common day a consumer will write a complaint
Consumers don't tend to write complaints on weekends (Saturday,Sunday)

COMPLAINT ORIGIN (STATE) ¶

Let's investigate the distribution from which geographical state the complaint was made from

In [24]:

Copied!





ldf = df['state'].value_counts(ascending=True).to_frame()[50:]

fig = px.bar(ldf,x='state',template='plotly_white',
       title='State of Complaint',height=400)

fig.update_traces(marker_line_color='#F1A424',marker_line_width=1,
                  marker={'color':'#F1A424'},width=0.4)
fig.show("png")
ldf = df['state'].value_counts(ascending=True).to_frame()[50:]

fig = px.bar(ldf,x='state',template='plotly_white',
       title='State of Complaint',height=400)

fig.update_traces(marker_line_color='#F1A424',marker_line_width=1,
                  marker={'color':'#F1A424'},width=0.4)
fig.show("png")

Most of the consumer complaints are from California, Florida, Texas, Georgia and New York

COMPARING RESPONSE TO DISPUTES¶

Lastly, we can combine the last two sections and see for each Product, how many disputes there have been each month

In [25]:

Copied!





disputed = df[['product','consumer disputed?','month']]

fig = px.histogram(disputed, y='product', 
                   color='consumer disputed?',
                   template='plotly_white',
                   height = 700,
                   barmode='group',
                   color_discrete_sequence=['#F1A424','#3b3745'],
                   facet_col_wrap=3,
                   facet_col='month')

fig.update_layout(showlegend=False)
fig.update_layout(barmode="overlay")
fig.update_traces(opacity=0.5)
fig.show("png")
disputed = df[['product','consumer disputed?','month']]

fig = px.histogram(disputed, y='product', 
                   color='consumer disputed?',
                   template='plotly_white',
                   height = 700,
                   barmode='group',
                   color_discrete_sequence=['#F1A424','#3b3745'],
                   facet_col_wrap=3,
                   facet_col='month')

fig.update_layout(showlegend=False)
fig.update_layout(barmode="overlay")
fig.update_traces(opacity=0.5)
fig.show("png")

Morgage & Debt Collection tend to be quite often disputed all year round
September & October have the highest disputed cases for Morgage & Debt Collection
July & August have an unusually high ammount of credit reporting complaints

6 | PREPARING DATA FOR MODELING
¶

We've done some exploratory data analysis & understand our problem target variables a little better, let's focus on preparing the data for machine learning
In total, we have 1159430 complaints, although not evenly distributed as we saw in Section 3
Our smallest class (virtual currency) has only 16 issues (which is very little data)

In [26]:

Copied!

df['product'].value_counts(ascending=False).to_frame().sum()
df['product'].value_counts(ascending=False).to_frame().sum()

Out[26]:

product    184012
dtype: int64

In [27]:

Copied!

df['product'].value_counts(ascending=False).to_frame().tail(3)
df['product'].value_counts(ascending=False).to_frame().tail(3)

Out[27]:

	product
prepaid card	1450
other financial service	292
virtual currency	16

REVIEW CLASS DISBALANCE¶

There doesn't seem to be any error associated with labelling, so let's not remove this subset

Circle is a Boston-based financial services company that uses blockchain technology for its peer-to-peer payments and cryptocurrency-related products.

Despite wanting to utilise Stratification, it seems like we may not have enough data for the model to be able to classify complaints with little data available

In [28]:

Copied!

print('Sample from virtual currency:')
vc = dict(tuple(df.groupby(by='product')))['virtual currency']
vc.iloc[[0]]['consumer complaint narrative'].values[0]
print('Sample from virtual currency:')
vc = dict(tuple(df.groupby(by='product')))['virtual currency']
vc.iloc[[0]]['consumer complaint narrative'].values[0]

Sample from virtual currency:

Out[28]:

'Signedup XXXX family members for referrals on Coinbase.com. Coinbase at that time offered {$75.00} for each person referred to their service. I referred all XXXX and they met the terms Coinbase intially offered. Signup took a while do to money transfer timeframes setup by Coinbase. In that time, Coinbase changed their promotion and terms to {$10.00} for referrals. When asked why, they said they could change terms at anytime ( even if signup up for {$75.00} referral bonus ) and that family members did not meet the terms either. Felt like they just change terms to disclude giving out referral bonuses.'

PRODUCT SUBSET AMBIGUITY¶

We have a feature Credit reporting, credit repair services, or other personal consumer reports, it seems like this feature is not quite sorted
We already have a separate subgroup for Credit reporting, but not for cedit repair services or other consumer reports
There is the possibility that this subgroup will contain complaints of other subgroups, which would affect the model accuracy
Lets remove the subset: Credit reporting, credit repair services, or other personal consumer reports for the time being

Some of the approaches we could take are:

Try to sort the data by keywords

In [29]:

Copied!





# When we have a feature that contains subsets, we can remove unwanted subsets
        
def remove_subset(df,feature,lst_groups):
    
    ndf = df.copy()

    # Let's down sample all classes with frequencies above 4k
    group = dict(tuple(ndf.groupby(by=feature)))
    subset_group = list(group.keys()) # subsets in feature
    
    # Check if features exist in columns
    if(set(lst_groups).issubset(subset_group)):

        # remove unwanted subset
        for k in lst_groups:
            group.pop(k, None)

        df = pd.concat(list(group.values()))
        df.reset_index(inplace=True,drop=True)

        return df    
# When we have a feature that contains subsets, we can remove unwanted subsets
        
def remove_subset(df,feature,lst_groups):
    
    ndf = df.copy()

    # Let's down sample all classes with frequencies above 4k
    group = dict(tuple(ndf.groupby(by=feature)))
    subset_group = list(group.keys()) # subsets in feature
    
    # Check if features exist in columns
    if(set(lst_groups).issubset(subset_group)):

        # remove unwanted subset
        for k in lst_groups:
            group.pop(k, None)

        df = pd.concat(list(group.values()))
        df.reset_index(inplace=True,drop=True)

        return df    

We are down to 611,233 complaints (about have of what we had)
Let's also confirm that our function remove_subset works correctly

LIMIT TARGET CLASS SUBSET SAMPLES¶

We saw that we have quite a bit of data available

Most of which are in particular classes (eg. credit collection ,debt reporting & mortgage)
Let's limit our data to 4000 text samples from each class using function downsample_features

In [30]:

Copied!





# Downsample selected 

def downsample_subset(df,feature,lst_groups,samples=4000):

    ndf = df.copy()

    # Let's down sample all classes with frequencies above 4k
    group = dict(tuple(ndf.groupby(by=feature)))
    subset_group = list(group.keys())

    # Check if features exist in columns
    if(set(lst_groups).issubset(subset_group)):

        dict_downsamples = {}
        for feature in lst_groups:
            dict_downsamples[feature] = group[feature].sample(samples)

        # remove old data
        for k in lst_groups:
            group.pop(k, None)

        # read them back
        group.update(dict_downsamples)

        df = pd.concat(list(group.values()))
        df.reset_index(inplace=True,drop=True)

        return df
    
    else:
        print('feature not found in dataframe')
        
# Downsample selected 

def downsample_subset(df,feature,lst_groups,samples=4000):

    ndf = df.copy()

    # Let's down sample all classes with frequencies above 4k
    group = dict(tuple(ndf.groupby(by=feature)))
    subset_group = list(group.keys())

    # Check if features exist in columns
    if(set(lst_groups).issubset(subset_group)):

        dict_downsamples = {}
        for feature in lst_groups:
            dict_downsamples[feature] = group[feature].sample(samples)

        # remove old data
        for k in lst_groups:
            group.pop(k, None)

        # read them back
        group.update(dict_downsamples)

        df = pd.concat(list(group.values()))
        df.reset_index(inplace=True,drop=True)

        return df
    
    else:
        print('feature not found in dataframe')
        

In [31]:

Copied!

# Select subset features which have more than 4000 samples
subset_list = list(df['product'].value_counts()[df['product'].value_counts().values > 4000].index)
df = downsample_subset(df,'product',subset_list,samples=4000)
# Select subset features which have more than 4000 samples
subset_list = list(df['product'].value_counts()[df['product'].value_counts().values > 4000].index)
df = downsample_subset(df,'product',subset_list,samples=4000)

Let's check if our function downsample_subset works correctly

In [32]:

Copied!

df['product'].value_counts()
df['product'].value_counts()

Out[32]:

debt collection            4000
mortgage                   4000
credit reporting           4000
credit card                4000
student loan               4000
bank account or service    4000
payday loan                1746
money transfers            1497
prepaid card               1450
other financial service     292
virtual currency             16
Name: product, dtype: int64

TRAIN-TEST SUBSET SPLITTING¶

Next, as per standard requirements, we need to be able to validate the model after training
We will split the data into two groups, training & validation subsets with a validation size of 0.2
Stratification will also be applied to both groups, in order to guarantee all classes in both subgroups

In [33]:

Copied!





# Select only relevant data
df_data = df[['consumer complaint narrative','product']]
df_data.columns = ['text','label']
df_data.head()
# Select only relevant data
df_data = df[['consumer complaint narrative','product']]
df_data.columns = ['text','label']
df_data.head()

Out[33]:

	text	label
0	I made a wire transfer through Citibank to XXX...	money transfers
1	I purchased a money order on XX/XX/2016 ( to c...	money transfers
2	I have complained of false online transfer num...	money transfers
3	I paid by bank wire transfer on XXXX/XXXX/XXXX...	money transfers
4	I found a XXXX Bulldog for sale on XXXX after ...	money transfers

In [34]:

Copied!





from sklearn.model_selection import train_test_split as tts

train_files,test_files, train_labels, test_labels = tts(df_data['text'],
                                                        df_data['label'],
                                                        test_size=0.1,
                                                        random_state=32,
                                                        stratify=df_data['label'])

train_files = pd.DataFrame(train_files)
test_files = pd.DataFrame(test_files)
train_files['label'] = train_labels
test_files['label'] = test_labels
from sklearn.model_selection import train_test_split as tts

train_files,test_files, train_labels, test_labels = tts(df_data['text'],
                                                        df_data['label'],
                                                        test_size=0.1,
                                                        random_state=32,
                                                        stratify=df_data['label'])

train_files = pd.DataFrame(train_files)
test_files = pd.DataFrame(test_files)
train_files['label'] = train_labels
test_files['label'] = test_labels

In [35]:

Copied!

print(type(train_files))
print('Training Data',train_files.shape)
print('Validation Data',test_files.shape)
print(type(train_files))
print('Training Data',train_files.shape)
print('Validation Data',test_files.shape)

<class 'pandas.core.frame.DataFrame'>
Training Data (26100, 2)
Validation Data (2901, 2)

In [36]:

Copied!





import plotly.express as px

train_values = train_files['label'].value_counts()
test_values = test_files['label'].value_counts()
visual = pd.concat([train_values,test_values],axis=1)
visual = visual.T
visual.index = ['train','test']

fig = px.bar(visual,template='plotly_white',
       barmode='group',text_auto=True,height=300,
       title='Train/Test Split Distribution')

fig.show("png")
import plotly.express as px

train_values = train_files['label'].value_counts()
test_values = test_files['label'].value_counts()
visual = pd.concat([train_values,test_values],axis=1)
visual = visual.T
visual.index = ['train','test']

fig = px.bar(visual,template='plotly_white',
       barmode='group',text_auto=True,height=300,
       title='Train/Test Split Distribution')

fig.show("png")

7 | LINEAR BASELINE MODEL
¶

In order to create a baseline model, let's extract the hidden state data from DistilBERT and use it as features for our linear model

GENERATING DATASET¶

We have a daframe containing text & label data for both training & validation datasets
We'll use HF's more intuitive to use Dataset class (which allows us to convert between types very easily)

In [37]:

Copied!

train_files
train_files

Out[37]:

	text	label
8290	I paid off all of my bills and should not have...	debt collection
1520	Several checks were issued from XXXX for possi...	other financial service
23071	In XXXX, we my husband and myself took out a l...	student loan
4123	I use an Amex Serve card ( a prepaid debit car...	prepaid card
16470	Despite YEARS of stellar credit reports and sc...	credit reporting
...	...	...
21733	I know that I am victim of student loan scam. ...	student loan
27578	On XX/XX/XXXX I made a payment of {$380.00} to...	bank account or service
26297	XXXX XXXX XXXX XXXX XXXX, AZ XXXX : ( XXXX ) X...	bank account or service
17867	After nearly a decade of business with Bank of...	credit card
144	On XXXX XXXX, 2015, I made a purchase on EBay....	money transfers

26100 rows × 2 columns

In [38]:

Copied!





import transformers
transformers.logging.set_verbosity_error()
import warnings; warnings.filterwarnings('ignore')
import os; os.environ['WANDB_DISABLED'] = 'true'
from datasets import Dataset,Features,Value,ClassLabel, DatasetDict 

traindts = Dataset.from_pandas(train_files)
traindts = traindts.class_encode_column("label")
testdts = Dataset.from_pandas(test_files)
testdts = testdts.class_encode_column("label")
import transformers
transformers.logging.set_verbosity_error()
import warnings; warnings.filterwarnings('ignore')
import os; os.environ['WANDB_DISABLED'] = 'true'
from datasets import Dataset,Features,Value,ClassLabel, DatasetDict 

traindts = Dataset.from_pandas(train_files)
traindts = traindts.class_encode_column("label")
testdts = Dataset.from_pandas(test_files)
testdts = testdts.class_encode_column("label")

In [39]:

Copied!





# Pandas indicies not reset ie. __index_level_0__ additional column is added, resetting index
corpus = DatasetDict({"train" : traindts , 
                      "validation" : testdts })
corpus['train']
# Pandas indicies not reset ie. __index_level_0__ additional column is added, resetting index
corpus = DatasetDict({"train" : traindts , 
                      "validation" : testdts })
corpus['train']

Out[39]:

Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 26100
})

TOKENISATION¶

Time to tokenise the text data
We'll be using the AutoTokenizer class from a pretrained model distilbert-base-uncased in order to generate subword tokens

In [40]:

Copied!





from transformers import AutoTokenizer

# Load parameters of the tokeniser
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# Tokenisation function
def tokenise(batch):
    return tokenizer(batch["text"], 
                     padding=True, 
                     truncation=True)

# apply to the entire dataset (train,test and validation dataset)
corpus_tokenised = corpus.map(tokenise, 
                              batched=True, 
                              batch_size=None)

print(corpus_tokenised["train"].column_names)
from transformers import AutoTokenizer

# Load parameters of the tokeniser
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# Tokenisation function
def tokenise(batch):
    return tokenizer(batch["text"], 
                     padding=True, 
                     truncation=True)

# apply to the entire dataset (train,test and validation dataset)
corpus_tokenised = corpus.map(tokenise, 
                              batched=True, 
                              batch_size=None)

print(corpus_tokenised["train"].column_names)

['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask']

LOADING PRESET MODEL¶

HuggingFace allows us to load a variety of pretrained models:

Let's utlilise the distilbert-base-uncased model

The model was trained to predict [mask] values

Given an input sequence (the above link shows an example)

Let's use it to extract the last hidden state of each input sequence

Use it to train a more tradition machine learning model (baseline M1 model)

In [41]:

Copied!





from transformers import AutoModel
import torch

# load a pretrained transformer model
model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# move model to device
model = AutoModel.from_pretrained(model_ckpt).to(device)
from transformers import AutoModel
import torch

# load a pretrained transformer model
model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# move model to device
model = AutoModel.from_pretrained(model_ckpt).to(device)

cuda
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

EXTRACT MODEL HIDDEN STATE¶

For each tokenised input text, we can utilise the loaded model & extract the last hidden state that can be used as features for machine learning models
The same strategy was applied in notebook Twitter Emotion Classification

In [42]:

Copied!





# Function used to store last_hidden_state data of distilbert-base-uncased
def extract_hidden_states(batch):
    
    # Place model inputs on the GPU
    inputs = {k:v.to(device) for k,v in batch.items()
              if k in tokenizer.model_input_names}
    
    # Extract last hidden states
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
        
    # Return vector for [CLS] token
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}
# Function used to store last_hidden_state data of distilbert-base-uncased
def extract_hidden_states(batch):
    
    # Place model inputs on the GPU
    inputs = {k:v.to(device) for k,v in batch.items()
              if k in tokenizer.model_input_names}
    
    # Extract last hidden states
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
        
    # Return vector for [CLS] token
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

TO TENSORS¶

Before training using the pytorch model, we need to set the corresponding format using set_format

In [43]:

Copied!





# Change Data to Torch tensor
corpus_tokenised.set_format("torch",
                            columns=["input_ids", "attention_mask", "label"])
corpus_tokenised
# Change Data to Torch tensor
corpus_tokenised.set_format("torch",
                            columns=["input_ids", "attention_mask", "label"])
corpus_tokenised

Out[43]:

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 26100
    })
    validation: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 2901
    })
})

INFERENCE & EXTRACT HIDDEN STATE¶

Uisng map we can apply the function extract_hidden_states to both training & validation datasets

In [44]:

Copied!





# Extract last hidden states (faster w/ GPU)
corpus_hidden = corpus_tokenised.map(extract_hidden_states, 
                                     batched=True,
                                     batch_size=32)
corpus_hidden["train"].column_names
# Extract last hidden states (faster w/ GPU)
corpus_hidden = corpus_tokenised.map(extract_hidden_states, 
                                     batched=True,
                                     batch_size=32)
corpus_hidden["train"].column_names

Out[44]:

['text',
 'label',
 '__index_level_0__',
 'input_ids',
 'attention_mask',
 'hidden_state']

In [45]:

Copied!

# Empty cache
torch.cuda.empty_cache()
# Empty cache
torch.cuda.empty_cache()

The extracted hidden state corpus corpus_hidden has been uploaded to dataset

In [46]:

Copied!





# Save our data
corpus_hidden.set_format(type="pandas")

# Add label data to dataframe
def label_int2str(row):
    return corpus["train"].features["label"].int2str(row)

ldf = corpus_hidden["train"][:]
ldf["label_name"] = ldf["label"].apply(label_int2str)
ldf.to_pickle('training.df')

ldf = corpus_hidden["validation"][:]
ldf["label_name"] = ldf["label"].apply(label_int2str)
ldf.to_pickle('validation.df')
# Save our data
corpus_hidden.set_format(type="pandas")

# Add label data to dataframe
def label_int2str(row):
    return corpus["train"].features["label"].int2str(row)

ldf = corpus_hidden["train"][:]
ldf["label_name"] = ldf["label"].apply(label_int2str)
ldf.to_pickle('training.df')

ldf = corpus_hidden["validation"][:]
ldf["label_name"] = ldf["label"].apply(label_int2str)
ldf.to_pickle('validation.df')

In [47]:

Copied!

!ls /kaggle/working/
# !ls /kaggle/input/hiddenstatedata/
!ls /kaggle/working/
# !ls /kaggle/input/hiddenstatedata/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
__notebook__.ipynb  training.df  validation.df

DEFINING HIDDEN STATE DATASET¶

Having extracted the last hidden state data from model distilbert-base-uncased, let's define the training data
The process is quite long so we'll load the data saved in the last section from training.csv & validation.csv

In [48]:

Copied!





# reload saved data
import pandas as pd
import pickle

# Load hidden state data
# training = pd.read_pickle('/kaggle/input/hiddenstatedata/training.df')
# validation = pd.read_pickle('/kaggle/input/hiddenstatedata/validation.df')
training = pd.read_pickle('training.df')
validation = pd.read_pickle('validation.df')
training.head()

labels = training[['label','label_name']]
# reload saved data
import pandas as pd
import pickle

# Load hidden state data
# training = pd.read_pickle('/kaggle/input/hiddenstatedata/training.df')
# validation = pd.read_pickle('/kaggle/input/hiddenstatedata/validation.df')
training = pd.read_pickle('training.df')
validation = pd.read_pickle('validation.df')
training.head()

labels = training[['label','label_name']]

In [49]:

Copied!

label = []
for i in labels.label.unique():
    label.append(labels[labels['label'] == i].iloc[[0]]['label_name'].values[0])
    
label
label = []
for i in labels.label.unique():
    label.append(labels[labels['label'] == i].iloc[[0]]['label_name'].values[0])
    
label

Out[49]:

['debt collection',
 'other financial service',
 'student loan',
 'prepaid card',
 'credit reporting',
 'mortgage',
 'payday loan',
 'credit card',
 'bank account or service',
 'money transfers',
 'virtual currency']

In [50]:

Copied!





# # Define our training & validation datasets

import numpy as np
X_train = np.stack(training['hidden_state'])
X_valid = np.stack(validation["hidden_state"])
y_train = np.array(training["label"])
y_valid = np.array(validation["label"])
print(f'Training Dataset: {X_train.shape}')
print(f'Validation Dataset {X_valid.shape}')
# # Define our training & validation datasets

import numpy as np
X_train = np.stack(training['hidden_state'])
X_valid = np.stack(validation["hidden_state"])
y_train = np.array(training["label"])
y_valid = np.array(validation["label"])
print(f'Training Dataset: {X_train.shape}')
print(f'Validation Dataset {X_valid.shape}')

Training Dataset: (26100, 768)
Validation Dataset (2901, 768)

TRAIN MODEL¶

Let's start with something quite simplistic, LogisticRegression often work quite well

In [51]:

Copied!

%%time

from sklearn.linear_model import LogisticRegression as LR

# We increase `max_iter` to guarantee convergence
lr_clf = LR(max_iter = 2000)
lr_clf.fit(X_train, y_train)
%%time

from sklearn.linear_model import LogisticRegression as LR

# We increase `max_iter` to guarantee convergence
lr_clf = LR(max_iter = 2000)
lr_clf.fit(X_train, y_train)

CPU times: user 8min 59s, sys: 39.2 s, total: 9min 38s
Wall time: 5min 5s

Out[51]:

LogisticRegression(max_iter=2000)

In [52]:

Copied!





# Predictions
y_preds_train = lr_clf.predict(X_train)
y_preds_valid = lr_clf.predict(X_valid)
print('LogisticRegression:')
print(f'training accuracy: {round(lr_clf.score(X_train, y_train),3)}')
print(f'validation accuracy: {round(lr_clf.score(X_valid, y_valid),3)}')
# Predictions
y_preds_train = lr_clf.predict(X_train)
y_preds_valid = lr_clf.predict(X_valid)
print('LogisticRegression:')
print(f'training accuracy: {round(lr_clf.score(X_train, y_train),3)}')
print(f'validation accuracy: {round(lr_clf.score(X_valid, y_valid),3)}')

LogisticRegression:
training accuracy: 0.807
validation accuracy: 0.772

In [53]:

Copied!





# save sklearn model
import joblib

filename = 'classifier.joblib.pkl'
_ = joblib.dump(lr_clf, filename, compress=9)

# load sklearn model
# lr_clf = joblib.load('/kaggle/input/hiddenstatedata/' + filename)
# lr_clf
# save sklearn model
import joblib

filename = 'classifier.joblib.pkl'
_ = joblib.dump(lr_clf, filename, compress=9)

# load sklearn model
# lr_clf = joblib.load('/kaggle/input/hiddenstatedata/' + filename)
# lr_clf

In [54]:

Copied!





import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

def plot_confusion_matrix(y_model, y_true, labels):
    cm = confusion_matrix(y_true,y_model,normalize='true')
    fig, ax = plt.subplots(figsize=(8,8))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm.round(2).copy(), display_labels=labels)
    disp.plot(ax=ax, colorbar=False)
    plt.title("Confusion matrix")
    plt.xticks(rotation = 90) # Rotates X-Axis Ticks by 45-degrees
    plt.tight_layout()
    plt.show()
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

def plot_confusion_matrix(y_model, y_true, labels):
    cm = confusion_matrix(y_true,y_model,normalize='true')
    fig, ax = plt.subplots(figsize=(8,8))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm.round(2).copy(), display_labels=labels)
    disp.plot(ax=ax, colorbar=False)
    plt.title("Confusion matrix")
    plt.xticks(rotation = 90) # Rotates X-Axis Ticks by 45-degrees
    plt.tight_layout()
    plt.show()

In [55]:

Copied!

labels = list(training.label_name.value_counts().index)

# Validation Dataset Confusion Matrix
plot_confusion_matrix(y_preds_valid, y_valid, labels)
labels = list(training.label_name.value_counts().index)

# Validation Dataset Confusion Matrix
plot_confusion_matrix(y_preds_valid, y_valid, labels)

Compressed distilbert embedding features work quite well in this this problem
Looks like we have quite a good model to begin with, scoring a validation accuracy of 0.77
We can note that the model payday loan & virtual currency are very pooly predicted subsets
other financial services is predicted quite well (which is surprising because for the transformer model, we have the opposite)

8 | FINE-TUNE DISTILBERT
¶

With the fine-tune approach, we do not use the hidden states as fixed features
Instead, we train them from a given model state
This requires the classification head to be differentiable (neural network for classification)

LOAD PRETRAINED MODEL¶

We'll load the same DistilBERT model using model_ckpt "distilbert-base-uncased"
This time however we will be loading AutoModelForSequenceClassification (we used AutoModel when we extracted embedding features)
AutoModelForSequenceClassification model has a classification head on top of the pretrained model outputs
We only need to specify the number of labels the model has to predict num_labels

In [56]:

Copied!

# Empty cache
torch.cuda.empty_cache()
# Empty cache
torch.cuda.empty_cache()

In [57]:

Copied!





# Change Data to Torch tensor
corpus_tokenised.set_format("torch",
                            columns=["input_ids", "attention_mask", "label"])
corpus_tokenised
# Change Data to Torch tensor
corpus_tokenised.set_format("torch",
                            columns=["input_ids", "attention_mask", "label"])
corpus_tokenised

Out[57]:

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 26100
    })
    validation: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 2901
    })
})

In [58]:

Copied!





from transformers import AutoModelForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_ckpt = "distilbert-base-uncased"

model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt, 
                          num_labels=len(labels))
         .to(device))
from transformers import AutoModelForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_ckpt = "distilbert-base-uncased"

model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt, 
                          num_labels=len(labels))
         .to(device))

EVALUATION METRICS¶

We'll monitor the F1 score & accuracy, the function is required to be passed in the Trainer class

In [59]:

Copied!





from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

DEFINE TRAINER¶

Next we need to define the model training parameters, which can be done using TrainingArguments
Let's train the DistilBERT model for 3 iterations with a learning rate of 2e-5 and a batch size of 64
The Trainer requires inputs of a model, model arguments, metrics, the datasets (train,validation) & the tokeniser

In [60]:

Copied!





from transformers import Trainer, TrainingArguments
from transformers import Trainer

bs = 16 # batch size
model_name = f"{model_ckpt}-finetuned-financial"
labels = corpus_tokenised["train"].features["label"].names

# Training Arguments
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=3,             # number of training epochs
                                  learning_rate=2e-5,             # model learning rate
                                  per_device_train_batch_size=bs, # batch size
                                  per_device_eval_batch_size=bs,  # batch size
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False, 
                                  report_to="none",
                                  push_to_hub=False,
                                  log_level="error")


trainer = Trainer(model=model,                                 # Model
                  args=training_args,                          # Training arguments (above)
                  compute_metrics=compute_metrics,             # Computational Metrics
                  train_dataset=corpus_tokenised["train"],     # Training Dataset   
                  eval_dataset=corpus_tokenised["validation"], # Evaluation Dataset
                  tokenizer=tokenizer)
from transformers import Trainer, TrainingArguments
from transformers import Trainer

bs = 16 # batch size
model_name = f"{model_ckpt}-finetuned-financial"
labels = corpus_tokenised["train"].features["label"].names

# Training Arguments
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=3,             # number of training epochs
                                  learning_rate=2e-5,             # model learning rate
                                  per_device_train_batch_size=bs, # batch size
                                  per_device_eval_batch_size=bs,  # batch size
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False, 
                                  report_to="none",
                                  push_to_hub=False,
                                  log_level="error")


trainer = Trainer(model=model,                                 # Model
                  args=training_args,                          # Training arguments (above)
                  compute_metrics=compute_metrics,             # Computational Metrics
                  train_dataset=corpus_tokenised["train"],     # Training Dataset   
                  eval_dataset=corpus_tokenised["validation"], # Evaluation Dataset
                  tokenizer=tokenizer)

TRAIN MODEL¶

Let's finally fine-tune our transform model to fit our classification problem

In [62]:

Copied!

%%time

# Train & save model
trainer.train()
trainer.save_model()
%%time

# Train & save model
trainer.train()
trainer.save_model()

[4896/4896 37:20, Epoch 3/3]

Epoch	Training Loss	Validation Loss	Accuracy	F1
1	0.535000	0.528666	0.842468	0.839433
2	0.410300	0.483131	0.858669	0.855979
3	0.312500	0.477747	0.866942	0.864213

CPU times: user 36min 55s, sys: 16.5 s, total: 37min 12s
Wall time: 37min 21s

In [63]:

Copied!

# from transformers import pipeline
# load from previously saved model
# classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-financial")
# from transformers import pipeline
# load from previously saved model
# classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-financial")

INFERENCE¶

Let's utilise our fine-tuned transformer model for inference on the validation dataset

In [64]:

Copied!

# Predict on Validation Dataset
pred_output = trainer.predict(corpus_tokenised["validation"])
pred_output
# Predict on Validation Dataset
pred_output = trainer.predict(corpus_tokenised["validation"])
pred_output

[182/182 00:26]

Out[64]:

PredictionOutput(predictions=array([[ 0.4898075 , -1.1539025 , -2.4000468 , ...,  2.664791  ,
        -1.7696137 , -1.85924   ],
       [ 5.524978  ,  1.4433606 , -1.2144918 , ..., -2.0602236 ,
        -2.2785227 , -3.1501598 ],
       [-0.99464035,  1.410322  ,  5.1908035 , ..., -2.1883903 ,
        -1.5084958 , -3.9436107 ],
       ...,
       [ 0.8029568 ,  2.751484  , -1.0246688 , ..., -2.6813054 ,
         0.32632264, -3.8141603 ],
       [ 0.27522528, -1.2025931 , -0.14139701, ..., -2.684458  ,
        -1.0188793 , -3.16859   ],
       [-0.17902231, -1.83018   , -0.8901656 , ..., -3.2526188 ,
         0.6185259 , -3.5121055 ]], dtype=float32), label_ids=array([4, 1, 2, ..., 0, 5, 5]), metrics={'test_loss': 0.47774738073349, 'test_accuracy': 0.8669424336435712, 'test_f1': 0.8642125632561599, 'test_runtime': 26.9558, 'test_samples_per_second': 107.621, 'test_steps_per_second': 6.752})

In [65]:

Copied!

print(f'Output Predition: {pred_output.predictions.shape}')
print(pred_output.predictions)
print(f'Output Predition: {pred_output.predictions.shape}')
print(pred_output.predictions)

Output Predition: (2901, 11)
[[ 0.4898075  -1.1539025  -2.4000468  ...  2.664791   -1.7696137
  -1.85924   ]
 [ 5.524978    1.4433606  -1.2144918  ... -2.0602236  -2.2785227
  -3.1501598 ]
 [-0.99464035  1.410322    5.1908035  ... -2.1883903  -1.5084958
  -3.9436107 ]
 ...
 [ 0.8029568   2.751484   -1.0246688  ... -2.6813054   0.32632264
  -3.8141603 ]
 [ 0.27522528 -1.2025931  -0.14139701 ... -2.684458   -1.0188793
  -3.16859   ]
 [-0.17902231 -1.83018    -0.8901656  ... -3.2526188   0.6185259
  -3.5121055 ]]

In [66]:

Copied!





import numpy as np

# Decode the predictions greedily using argmax (highest value of all classes)
y_preds = np.argmax(pred_output.predictions,axis=1)
print(f'Output Prediction:{y_preds.shape}')
print(f'Predictions: {y_preds}')
import numpy as np

# Decode the predictions greedily using argmax (highest value of all classes)
y_preds = np.argmax(pred_output.predictions,axis=1)
print(f'Output Prediction:{y_preds.shape}')
print(f'Predictions: {y_preds}')

Output Prediction:(2901,)
Predictions: [4 0 2 ... 1 5 5]

In [67]:

Copied!

# Validation 
plot_confusion_matrix(y_preds,y_valid,labels)
# Validation 
plot_confusion_matrix(y_preds,y_valid,labels)

Fine-tuning a pretrained-transformed works sufficiently better than our baseline approach
virtual currency is also not predicted very well (same as the linear model)
payday loan is predicted well but other financial services poorly

Financial consumer complaint analysis

1 | BACKGROUND¶

CONSUMER FEEDBACK¶

2 | NOTEBOOK WORKFLOW¶

STUDY AIM¶

AUTOMATED TICKET CLASSIFICATION MODEL¶

3 | DATA PREPROCESSING¶

CONSUMER COMPLAINT DATABASE¶

LOAD DATASET¶

TARGET LABELS¶

MISSING DATA¶

FEATURE DESCRIPTION¶

ADDING DATETIME FEATURES¶

COLUMN NAME NORMALISATION¶

NORMALISATION OF SUBSET NAMES¶

FILTER SUBSET DATA¶

SELECT SUBSET OF DATA¶

4 | TARGET DISTRIBUTION¶

PRODUCT¶

5 | EXPLORATORY DATA ANALYSIS¶

CONSUMER COMPLAINT METHOD¶

FINANCIAL COMPANY¶

COMPANY RESPONSE¶

COMPLAINT TIMELINE¶

CONSUMER COMPLAINT TIMELINE TRENDS¶

COMPLAINT ORIGIN (STATE) ¶

COMPARING RESPONSE TO DISPUTES¶

6 | PREPARING DATA FOR MODELING¶

REVIEW CLASS DISBALANCE¶

PRODUCT SUBSET AMBIGUITY¶

LIMIT TARGET CLASS SUBSET SAMPLES¶

TRAIN-TEST SUBSET SPLITTING¶

7 | LINEAR BASELINE MODEL¶

GENERATING DATASET¶

TOKENISATION¶

LOADING PRESET MODEL¶

EXTRACT MODEL HIDDEN STATE¶

TO TENSORS¶

INFERENCE & EXTRACT HIDDEN STATE¶

DEFINING HIDDEN STATE DATASET¶

TRAIN MODEL¶

8 | FINE-TUNE DISTILBERT¶

LOAD PRETRAINED MODEL¶

EVALUATION METRICS¶

DEFINE TRAINER¶

TRAIN MODEL¶

INFERENCE¶

1 | BACKGROUND
¶

CONSUMER FEEDBACK
¶

2 | NOTEBOOK WORKFLOW
¶

3 | DATA PREPROCESSING
¶

4 | TARGET DISTRIBUTION
¶

5 | EXPLORATORY DATA ANALYSIS
¶

6 | PREPARING DATA FOR MODELING
¶

7 | LINEAR BASELINE MODEL
¶

8 | FINE-TUNE DISTILBERT
¶