Named Entity Recognition with Torch Loop¶
In this notebook, we'll take a look at how we can utilise HuggingFace
to easily load and use BERT
for token classification. Whilst we are loading both the base model & tokeniser from HuggingFace
, we'll be using a custom Torch
training loop and tail model customisation. The approach isn't the most straightforward but it is one way we can do it. We'll be utilising Massive
dataset by Amazon and fine-tune the transformer encoder BERT
Background¶
What is NER?¶
- NER is a natural language processing technique which identifies and extracts named entities from unstructured text
- Named entities refer to words or combination of words that represent specific objects, places etc, in principle it can be anything we define it to be
- NER algorithms use Machine or Deep Learning algorithms to analyse text and recognise pattens that indicate the presence of a named entity
Applications¶
Named Entity Recognition has a wide range of applications in the field of Natural Language Processing and Information Retrieval.
Few such examples have been listed below1:
Classifying content for news providers:
A large amount of online content is generated by the news and publishing houses on a daily basis and managing them correctly can be a challenging task for the human workers. Named Entity Recognition can automatically scan entire articles and help in identifying and retrieving major people, organizations, and places discussed in them. Thus articles are automatically categorized in defined hierarchies and the content is also much easily discovered.
Automatically Summarizing Resumes:
You might have come across various tools that scan your resume and retrieve important information such as Name, Address, Qualification, etc from them. The majority of such tools use the NER software which helps it to retrieve such information. Also one of the challenging tasks faced by the HR Departments across companies is to evaluate a gigantic pile of resumes to shortlist candidates. A lot of these resumes are excessively populated in detail, of which, most of the information is irrelevant to the evaluator. Using the NER model, the relevant information to the evaluator can be easily retrieved from them thereby simplifying the effort required in shortlisting candidates among a pile of resumes.
Optimizing Search Engine Algorithms:
When designing a search engine algorithm, It would be an inefficient and computational task to search for an entire query across the millions of articles and websites online, an alternate way is to run a NER model on the articles once and store the entities associated with them permanently. Thus for a quick and efficient search, the key tags in the search query can be compared with the tags associated with the website articles
Powering Recommendation systems:
NER can be used in developing algorithms for recommender systems that make suggestions based on our search history or on our present activity. This is achieved by extracting the entities associated with the content in our history or previous activity and comparing them with the label assigned to other unseen content. Thus we frequently see the content of our interest.
Simplifying Customer Support:
Usually, a company gets tons of customer complaints and feedback on a daily basis, and going through each one of them and recognizing the concerned parties is not an easy task. Using NER we can recognize relevant entities in customer complaints and feedback such as Product specifications, department, or company branch location so that the feedback is classified accordingly and forwarded to the appropriate department responsible for the identified product.
The Dataset¶
To realise NER
with Hugginface, we'll be utilising a muli-language dataset massive
Massive 1.1¶
Let's load our dataset MASSIVE; the dataset can be found on huggingface
MASSIVE 1.1 is a parallel dataset of > 1M utterances across 52 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.
We will be utilising this dataset for Named Entity Recognition, let's load our dataset, selecting only a subset of the data ru-RU
, which is a subset of column locale
There are quite a number of ways we can extract data from this dataset, let's use the fastest way:
- By specifying which locale
group we want to load (so we don't load everything)
- Define the subset of rows we want to use by using split
(eg. train[10:100])
from datasets import load_dataset, Dataset
train_dataset = load_dataset('AmazonScience/massive', "ru-RU",split="train[:100]")
test_dataset = load_dataset('AmazonScience/massive', "ru-RU",split="test[:100]")
Relevant Columns¶
Some samples from our documents, located in column utt
and NER Annotations are located in annot_utt
, which is in the format [tag : tokens]
['разбуди меня в девять утра в пятницу',
'поставь будильник на два часа вперед',
'олли тихо',
'отстановись',
'олли остановись на десять секунд',
'остановись на десять секунд',
'сделай освещение здесь чуть более тёплым',
'пожалуйста сделай свет подходящий для чтения',
'время идти спать',
'олли время спать']
['разбуди меня в [time : девять утра] в [date : пятницу]',
'поставь будильник [time : на два часа вперед]',
'олли тихо',
'отстановись',
'олли остановись на [time : десять секунд]',
'остановись на [time : десять секунд]',
'сделай освещение здесь чуть более [color_type : тёплым]',
'пожалуйста сделай свет [color_type : подходящий для чтения]',
'время идти спать',
'олли время спать']
Text Preprocessing¶
Define Tokeniser & Model¶
For preprocessing we'll need a tokeniser, so let's define the tokeniser & the base model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ai-forever/ruBert-base")
MODEL = AutoModel.from_pretrained("ai-forever/ruBert-base")
# tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased") # multilingual
# model = AutoModel.from_pretrained("bert-base-multilingual-cased") # multilingual
# tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # en only
# model = AutoModel.from_pretrained("bert-base-uncased") # en only
NER format¶
NER tagging formats can vary quite a bit, let's utilise class Parser
to preprocess the NER format in our dataset
Class Parser
contains the relevant preprocessing steps we will take:
- The class will be called when creating our Dataset
, creating NER tags for each document
- Its input will be the document & annotated document
- The document itself is imported in order to create O tags
- Whilst the annotated document is used to extract the annotated tags
- Tag dictionaries are created tag_to_id
, id_to_tag
from typing import List
import regex as re
'''
PARSER FOR THE DATASET NER TAG FORMAT
'''
class Parser:
# RE patterns for tag extraction
LABEL_PATTERN = r"\[(.*?)\]"
PUNCTUATION_PATTERN = r"([.,\/#!$%\^&\*;:{}=\-_`~()'\"’¿])"
# initialise, first word/id tag is O (outside)
def __init__(self):
self.tag_to_id = {
"O": 0
}
self.id_to_tag = {
0: "O"
}
'''
CREATE TAGS
'''
# input : sentence, tagged sentence
def __call__(self, sentence: str, annotated: str) -> List[str]:
''' Create Dictionary of Identified Tags'''
# 1. set label B or I
matches = re.findall(self.LABEL_PATTERN, annotated)
word_to_tag = {}
for match in matches:
tag, phrase = match.split(" : ")
words = phrase.split(" ")
word_to_tag[words[0]] = f"B-{tag.upper()}"
for w in words[1:]:
word_to_tag[w] = f"I-{tag.upper()}"
''' Tokenise Sentence & add tags to not tagged words (O)'''
# 2. add token tag to main tag dictionary
tags = []
sentence = re.sub(self.PUNCTUATION_PATTERN, r" \1 ", sentence)
for w in sentence.split():
if w not in word_to_tag:
tags.append("O")
else:
tags.append(word_to_tag[w])
self.__add_tag(word_to_tag[w])
return tags
'''
TAG CONVERSION
'''
# to word2id (tag_to_id)
# to id2word (id_to_tag)
def __add_tag(self, tag: str):
if tag in self.tag_to_id:
return
id_ = len(self.tag_to_id)
self.tag_to_id[tag] = id_
self.id_to_tag[id_] = tag
''' Get Tag Number ID '''
# or just number id for token
def get_id(self, tag: str):
return self.tag_to_id[tag]
''' Get Tag Token from Number ID'''
# given id get its token
def get_label(self, id_: int):
return self.get_tag_label(id_)
parser = Parser()
parser(train_dataset["utt"][0], train_dataset["annot_utt"][0])
Create Dataset¶
The functions NERDataset
is our Dataset class, it requires the Parser
class instance
Input into the parser
orbjects are utt
& its annotated version annot_utt
The output from parser
will be a list of target variable tags BIO
tags
['O', 'O', 'O', 'B-TIME', 'I-TIME', 'O', 'B-DATE']
['O', 'O', 'B-TIME', 'I-TIME', 'I-TIME', 'I-TIME']
['O', 'O']
['O']
['O', 'O', 'O', 'B-TIME', 'I-TIME']
['O', 'O', 'B-TIME', 'I-TIME']
['O', 'O', 'O', 'O', 'O', 'B-COLOR_TYPE']
['O', 'O', 'O', 'B-COLOR_TYPE', 'I-COLOR_TYPE', 'I-COLOR_TYPE']
['O', 'O', 'O']
The NERDataset
output each item (document) looks like the following:
tmp[idx] = {
**tokenizer_output, # BERT related token inputs
"subword_group": torch.tensor(subword_group), # token/word association (start,number of tokens in word)
"target": torch.tensor(target) # word NER tag
}
Now the class itself:
from torch.utils.data import Dataset, DataLoader
class NERDataset(Dataset):
def __init__(self, dataset, tokenizer):
self.tokenizer = tokenizer
self.processed_data = self.__preprocess(dataset)
def __len__(self):
return len(self.processed_data)
def __getitem__(self, idx):
return self.processed_data[idx]
def __preprocess(self, dataset):
tmp = {}
for idx in tqdm(range(len(dataset))):
item = dataset[idx]
tags = parser(item["utt"], item["annot_utt"]) # get list of tags
tokenizer_output = self.tokenizer(item["utt"], # tokenise document (incl. <bos>,<eos>)
padding=True,
truncation=True,
return_tensors='pt')
# token word identifier (each word can have multiple tokens)
word_ids = tokenizer_output.word_ids()
# for each word, how many subtokens are there (starts with 1 - first word)
subword_group = [
(key + 1, len(list(group)))
for key, group in itertools.groupby(word_ids)
if key is not None
] # index to aggregate tokens
# define bio tags for each word in numerical format using parser
target = [parser.get_id(t) for t in tags]
# group all relevant data that will be used in forward pass
tmp[idx] = {
**tokenizer_output,
"subword_group": torch.tensor(subword_group),
"target": torch.tensor(target)
}
# check consistency
try:
assert (len(subword_group) == len(target))
except:
print(item["annot_utt"], subword_group, target)
return tmp
train = NERDataset(train_dataset, tokenizer)
test = NERDataset(test_dataset, tokenizer)
0%| | 0/1000 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|██████████| 1000/1000 [00:00<00:00, 1690.12it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1598.76it/s]
BERT Multiclass Classifier¶
We will use a rather custom approach to NER
, we have loaded our encoder
model, so we'll need to create a tail end for it, so we can use it for NER
multiclass classification
Model Architecture¶
Let's create a custom training & evaluation loop for our NER multiclass classification problem
- In total we have 34 classification tags (including O), so we can define the tail output dimension
- Training the BERT model using
input_ids
&attention_mask
(tokens) we extract the hidden state for all tokens in document - Using the created
subword_group
data, we extract the token embeddings that correpospond to each word in the document - For words which have more than one token, we average token embeddings to obtain word embeddings
- The word embeddings are passed to the linear tail classifier, which returns the logits for each word in the sentence
- To get the most probable tag for each word in the document, we use
torch.argmax
To utilise our custom model with methods such as .from_pretrained
, let's add PreTrainedModel
, which contains nn.Module
import torch.nn as nn
from transformers import PreTrainedModel
from transformers import PretrainedConfig
from transformers import AutoModel, AutoConfig
class MyConfig(PretrainedConfig):
model_type = 'mymodel'
def __init__(self, important_param=42, **kwargs):
super().__init__(**kwargs)
self.important_param = important_param
# PreTrainedModel has nn.Module
class NERClassifier(PreTrainedModel):
config_class = MyConfig
def __init__(self,config):
super().__init__(config)
self.bert = MODEL
self.seq = nn.Sequential(
nn.Linear(768, 256),
nn.ReLU(),
nn.Linear(256, CLASSES),
)
def forward(self, inputs): # returns list of targets
# standard inputs for BERT
bert_output = self.bert(
inputs["input_ids"],
inputs["attention_mask"]
)
# output of transformer encoder will be our hidden state for
# each input_ids
last_hidden_state = bert_output["last_hidden_state"]
# tokens correspond tokenizer divisions
# each word can be split into multiple tokens
# ie. get the mean word embedding
target = []
for group in inputs["subword_group"]:
b, e = group
word_embedding = last_hidden_state[:, b:b+e] # get the token embeddings
agg_embedding = torch.mean(word_embedding, dim=1) # mean word embeddings for tokens
# input mean word embedding (1,768) pass into nn.Sequential linear tail end
proba = self.__forward_one(agg_embedding) # logits data
target.append(proba)
word_logits = torch.stack(target).squeeze(1)
return word_logits
def __forward_one(self, x):
logits = self.seq(x)
return logits
Model Additions¶
Aside from defining the classification model architecture for our base model BERT - We need to create dataloaders, instantiate our model (ie tail weights need to be initialised) - Define the model evaluation criterion (loss function) - Define the relevant optimiser to update weights
import torch.optim as optim
# define data loaders
train_loader = DataLoader(train)
test_loader = DataLoader(test)
CLASSES = len(parser.tag_to_id); print(f'{CLASSES} labels')
config = MyConfig(4)
# define classifier model, loss fucntion & optimiser
clf = NERClassifier(config).to(device)
criterion = nn.CrossEntropyLoss() # for multiclass classification
optimizer = optim.Adam(clf.parameters(), lr=1e-5)
In total, we have created 63 tags in word_to_tag
Train Model¶
- Let's train our model for 20 epochs on
train_loader
& validate the model usingtest_loader
- To check how well the model is performing during training, we will use the
accuracy
metric
from sklearn.metrics import accuracy_score
for epoch in range(20):
'''
(1) TRAINING LOOP
'''
loss_count, loss_sum = 0, 0
y_true, y_pred = [], []
# switch to training mode, ie backpropagation on
clf.train()
for data in tqdm(train_loader):
# move data to device
inputs = {
key: val.squeeze(0).to(device)
for key, val in data.items()
}
# logits of belonging to each of the tag class
# for all words in document
outputs = clf(inputs)
# predicted word tag
word_tag = torch.argmax(outputs, dim=1).tolist()
y_true.extend(inputs["target"].tolist())
y_pred.extend(word_tag)
# calcualate loss
loss = criterion(outputs, inputs["target"])
loss_count += 1
loss_sum += loss.item()
# nn.utils.clip_grad_norm_(
# parameters=clf.parameters(), max_norm=20
# )
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch-{epoch + 1}: loss: {loss_sum / loss_count}; acc: {accuracy_score(y_true, y_pred)}")
'''
(2) VALIDATION LOOP
'''
test_loss_sum, test_loss_count = 0, 0
test_true, test_pred = [], []
# switch to inference mode
with torch.no_grad():
for test_rows in tqdm(test_loader):
# move data to device
test_inputs = {
key: val.squeeze(0).to(device)
for key, val in test_rows.items()
}
test_outputs = clf(test_inputs)
# add metric data
test_true.extend(test_inputs["target"].tolist())
test_pred.extend(torch.argmax(test_outputs, dim=1).tolist())
test_loss = criterion(test_outputs, test_inputs["target"])
test_loss_count += 1
test_loss_sum += test_loss.item()
print(f"Epoch-{epoch + 1}: loss: {loss_sum / loss_count}; acc: {accuracy_score(y_true, y_pred)},\
val_loss: {test_loss_sum / test_loss_count}, val_acc: {accuracy_score(test_true, test_pred)}")
100%|██████████| 1000/1000 [01:02<00:00, 16.09it/s]
Epoch-1: loss: 1.6441781679093839; acc: 0.7374777271827361
100%|██████████| 1000/1000 [00:11<00:00, 88.19it/s]
Epoch-1: loss: 1.6441781679093839; acc: 0.7374777271827361, val_loss: 1.2796487891450525, val_acc: 0.7421347230264428
100%|██████████| 1000/1000 [01:00<00:00, 16.50it/s]
Epoch-2: loss: 1.1323998833782971; acc: 0.7539101168085528
100%|██████████| 1000/1000 [00:11<00:00, 87.29it/s]
Epoch-2: loss: 1.1323998833782971; acc: 0.7539101168085528, val_loss: 1.065644938390702, val_acc: 0.7639451843273499
100%|██████████| 1000/1000 [01:01<00:00, 16.32it/s]
Epoch-3: loss: 0.8694944252893329; acc: 0.7895466244308058
100%|██████████| 1000/1000 [00:11<00:00, 85.78it/s]
Epoch-3: loss: 0.8694944252893329; acc: 0.7895466244308058, val_loss: 0.9206145468372852, val_acc: 0.7815093611271955
100%|██████████| 1000/1000 [00:58<00:00, 16.95it/s]
Epoch-4: loss: 0.6839374071029015; acc: 0.8204315977034251
100%|██████████| 1000/1000 [00:11<00:00, 84.64it/s]
Epoch-4: loss: 0.6839374071029015; acc: 0.8204315977034251, val_loss: 0.8384006800730712, val_acc: 0.7915460335842501
100%|██████████| 1000/1000 [00:59<00:00, 16.81it/s]
Epoch-5: loss: 0.5615229318903293; acc: 0.8445852306473965
100%|██████████| 1000/1000 [00:11<00:00, 87.55it/s]
Epoch-5: loss: 0.5615229318903293; acc: 0.8445852306473965, val_loss: 0.7858886199912521, val_acc: 0.7979154603358425
100%|██████████| 1000/1000 [00:59<00:00, 16.78it/s]
Epoch-6: loss: 0.4753220743876882; acc: 0.8643832904375371
100%|██████████| 1000/1000 [00:11<00:00, 88.00it/s]
Epoch-6: loss: 0.4753220743876882; acc: 0.8643832904375371, val_loss: 0.8105039822029648, val_acc: 0.7971434086083767
100%|██████████| 1000/1000 [00:59<00:00, 16.81it/s]
Epoch-7: loss: 0.3950597488232888; acc: 0.8774500098990299
100%|██████████| 1000/1000 [00:11<00:00, 84.88it/s]
Epoch-7: loss: 0.3950597488232888; acc: 0.8774500098990299, val_loss: 0.7834225822311127, val_acc: 0.8062150164060992
100%|██████████| 1000/1000 [00:59<00:00, 16.88it/s]
Epoch-8: loss: 0.33086000567564045; acc: 0.8948723025143536
100%|██████████| 1000/1000 [00:11<00:00, 84.73it/s]
Epoch-8: loss: 0.33086000567564045; acc: 0.8948723025143536, val_loss: 0.767055139105185, val_acc: 0.8096892491796951
100%|██████████| 1000/1000 [01:00<00:00, 16.50it/s]
Epoch-9: loss: 0.28477367215882987; acc: 0.9103147891506632
100%|██████████| 1000/1000 [00:11<00:00, 85.45it/s]
Epoch-9: loss: 0.28477367215882987; acc: 0.9103147891506632, val_loss: 0.7778430652543611, val_acc: 0.8029337965643698
100%|██████████| 1000/1000 [00:59<00:00, 16.81it/s]
Epoch-10: loss: 0.23926631732314127; acc: 0.9227875668184518
100%|██████████| 1000/1000 [00:11<00:00, 88.43it/s]
Epoch-10: loss: 0.23926631732314127; acc: 0.9227875668184518, val_loss: 0.7841836358316068, val_acc: 0.8120054043620922
100%|██████████| 1000/1000 [00:59<00:00, 16.82it/s]
Epoch-11: loss: 0.20171416073056753; acc: 0.9362502474757474
100%|██████████| 1000/1000 [00:11<00:00, 84.28it/s]
Epoch-11: loss: 0.20171416073056753; acc: 0.9362502474757474, val_loss: 0.849082443419844, val_acc: 0.8104613009071607
100%|██████████| 1000/1000 [00:59<00:00, 16.83it/s]
Epoch-12: loss: 0.17783702248180636; acc: 0.9459512967729162
100%|██████████| 1000/1000 [00:11<00:00, 87.20it/s]
Epoch-12: loss: 0.17783702248180636; acc: 0.9459512967729162, val_loss: 0.7976406391558412, val_acc: 0.8046709129511678
100%|██████████| 1000/1000 [00:59<00:00, 16.88it/s]
Epoch-13: loss: 0.14918355378504203; acc: 0.9503068699267472
100%|██████████| 1000/1000 [00:11<00:00, 87.37it/s]
Epoch-13: loss: 0.14918355378504203; acc: 0.9503068699267472, val_loss: 0.8160117758086053, val_acc: 0.8118123914302259
100%|██████████| 1000/1000 [01:00<00:00, 16.60it/s]
Epoch-14: loss: 0.1313198971484453; acc: 0.962383686398733
100%|██████████| 1000/1000 [00:11<00:00, 88.59it/s]
Epoch-14: loss: 0.1313198971484453; acc: 0.962383686398733, val_loss: 0.8072414025840553, val_acc: 0.815479637135688
100%|██████████| 1000/1000 [00:59<00:00, 16.84it/s]
Epoch-15: loss: 0.11704726772185677; acc: 0.9633735893882399
100%|██████████| 1000/1000 [00:11<00:00, 85.19it/s]
Epoch-15: loss: 0.11704726772185677; acc: 0.9633735893882399, val_loss: 0.8006598546078457, val_acc: 0.8152866242038217
100%|██████████| 1000/1000 [00:59<00:00, 16.79it/s]
Epoch-16: loss: 0.09824711217604636; acc: 0.9742625222728173
100%|██████████| 1000/1000 [00:11<00:00, 88.90it/s]
Epoch-16: loss: 0.09824711217604636; acc: 0.9742625222728173, val_loss: 0.8438674144655451, val_acc: 0.8177957923180853
100%|██████████| 1000/1000 [00:59<00:00, 16.79it/s]
Epoch-17: loss: 0.0852395541092883; acc: 0.9750544446644229
100%|██████████| 1000/1000 [00:11<00:00, 88.00it/s]
Epoch-17: loss: 0.0852395541092883; acc: 0.9750544446644229, val_loss: 0.8193156978896586, val_acc: 0.8143215595444895
100%|██████████| 1000/1000 [00:59<00:00, 16.83it/s]
Epoch-18: loss: 0.07844399727860218; acc: 0.9784201148287468
100%|██████████| 1000/1000 [00:11<00:00, 88.51it/s]
Epoch-18: loss: 0.07844399727860218; acc: 0.9784201148287468, val_loss: 0.8360713951853868, val_acc: 0.8251302837290099
100%|██████████| 1000/1000 [00:59<00:00, 16.86it/s]
Epoch-19: loss: 0.07729116444718602; acc: 0.9788160760245496
100%|██████████| 1000/1000 [00:11<00:00, 85.81it/s]
Epoch-19: loss: 0.07729116444718602; acc: 0.9788160760245496, val_loss: 0.8346432074703807, val_acc: 0.8210770121598147
100%|██████████| 1000/1000 [00:59<00:00, 16.82it/s]
Epoch-20: loss: 0.06690881398832789; acc: 0.9798059790140566
100%|██████████| 1000/1000 [00:11<00:00, 88.31it/s]
Epoch-20: loss: 0.06690881398832789; acc: 0.9798059790140566, val_loss: 0.8577917314651385, val_acc: 0.8183748311136846
100%|██████████| 1000/1000 [00:59<00:00, 16.80it/s]
Epoch-21: loss: 0.05367027178355602; acc: 0.985349435755296
100%|██████████| 1000/1000 [00:11<00:00, 88.59it/s]
Epoch-21: loss: 0.05367027178355602; acc: 0.985349435755296, val_loss: 0.8853590850097044, val_acc: 0.8235861802740784
100%|██████████| 1000/1000 [00:59<00:00, 16.79it/s]
Epoch-22: loss: 0.04562243979116829; acc: 0.9879231835280142
100%|██████████| 1000/1000 [00:11<00:00, 88.02it/s]
Epoch-22: loss: 0.04562243979116829; acc: 0.9879231835280142, val_loss: 0.8947100080056626, val_acc: 0.8251302837290099
100%|██████████| 1000/1000 [00:59<00:00, 16.80it/s]
Epoch-23: loss: 0.04487149683444659; acc: 0.9875272223322115
100%|██████████| 1000/1000 [00:11<00:00, 83.96it/s]
Epoch-23: loss: 0.04487149683444659; acc: 0.9875272223322115, val_loss: 0.9012678128228526, val_acc: 0.8257093225246092
100%|██████████| 1000/1000 [00:59<00:00, 16.83it/s]
Epoch-24: loss: 0.043597472023681805; acc: 0.9869332805385073
100%|██████████| 1000/1000 [00:11<00:00, 88.64it/s]
Epoch-24: loss: 0.043597472023681805; acc: 0.9869332805385073, val_loss: 0.9172096501152982, val_acc: 0.8272534259795407
100%|██████████| 1000/1000 [00:59<00:00, 16.78it/s]
Epoch-25: loss: 0.039683220311884725; acc: 0.9879231835280142
100%|██████████| 1000/1000 [00:11<00:00, 89.10it/s]
Epoch-25: loss: 0.039683220311884725; acc: 0.9879231835280142, val_loss: 0.9694385796532524, val_acc: 0.8204979733642154
100%|██████████| 1000/1000 [00:59<00:00, 16.79it/s]
Epoch-26: loss: 0.03646317584309327; acc: 0.9899029895070283
100%|██████████| 1000/1000 [00:11<00:00, 88.90it/s]
Epoch-26: loss: 0.03646317584309327; acc: 0.9899029895070283, val_loss: 0.9861661683519687, val_acc: 0.8266743871839414
100%|██████████| 1000/1000 [00:59<00:00, 16.75it/s]
Epoch-27: loss: 0.03880691091310382; acc: 0.9893090477133241
100%|██████████| 1000/1000 [00:11<00:00, 84.44it/s]
Epoch-27: loss: 0.03880691091310382; acc: 0.9893090477133241, val_loss: 0.9058586812789617, val_acc: 0.8210770121598147
100%|██████████| 1000/1000 [00:59<00:00, 16.83it/s]
Epoch-28: loss: 0.037079696985995725; acc: 0.9887151059196199
100%|██████████| 1000/1000 [00:11<00:00, 87.97it/s]
Epoch-28: loss: 0.037079696985995725; acc: 0.9887151059196199, val_loss: 0.925899648670832, val_acc: 0.8272534259795407
100%|██████████| 1000/1000 [00:59<00:00, 16.94it/s]
Epoch-29: loss: 0.023646224649064154; acc: 0.9932686596713522
100%|██████████| 1000/1000 [00:11<00:00, 88.34it/s]
Epoch-29: loss: 0.023646224649064154; acc: 0.9932686596713522, val_loss: 0.9826786444298395, val_acc: 0.8230071414784791
100%|██████████| 1000/1000 [00:59<00:00, 16.94it/s]
Epoch-30: loss: 0.030056740869996247; acc: 0.9914868342902395
100%|██████████| 1000/1000 [00:11<00:00, 88.16it/s]
Epoch-30: loss: 0.030056740869996247; acc: 0.9914868342902395, val_loss: 1.0284815851225262, val_acc: 0.8245512449334106
Save Model¶
We can save the model for future use
('./my_model_dir/tokenizer_config.json',
'./my_model_dir/special_tokens_map.json',
'./my_model_dir/vocab.txt',
'./my_model_dir/added_tokens.json',
'./my_model_dir/tokenizer.json')
Using our NER tagger¶
We can use our NER
tagger by loading a stack of documents and creating a dataset
, as we did for the train
& test
subsets or we can use it for a single sentence, I'll be using a simple inference function.
Loading Model¶
If we want to load the fine-tuned model
, we can use .from_pretrained
and specify the folder where ther model is located,
the tokenizer
can be loaded in a similar way
NERClassifier(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(120138, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0-11): 12 x BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(seq): Sequential(
(0): Linear(in_features=768, out_features=256, bias=True)
(1): ReLU()
(2): Linear(in_features=256, out_features=63, bias=True)
)
)
Finding NER tags in a document¶
We can use our model using something similar to the code below & utilise the same tokenizer
, the parser
& saved model
def ner_inference(text,model):
# Tokenise input
tokenizer_output = tokenizer(text,
padding=True,
truncation=True,
return_tensors='pt')
# token word identifier (each word can have multiple tokens)
word_ids = tokenizer_output.word_ids()
# for each word, how many subtokens are there (starts with 1 - first word)
subword_group = [
(key + 1, len(list(group)))
for key, group in itertools.groupby(word_ids)
if key is not None
] # index to aggregate tokens
# group all relevant data that will be used in forward pass
output = {
**tokenizer_output,
"subword_group": torch.tensor(subword_group),
}
# get the highest logits value
tag_pred = torch.argmax(model(output),axis=1).tolist()
tag_pred
return tag_pred
ner_inference('В девять утра я улетаю в Зимбабве',new_model)
The model predicts the following tags, which we can map using parser.id_to_tag
REFERENCES¶
-
https://www.mygreatlearning.com/blog/named-entity-recognition/# ↩