Skip to content

Named Entity Recognition with Torch Loop

In this notebook, we'll take a look at how we can utilise HuggingFace to easily load and use BERT for token classification. Whilst we are loading both the base model & tokeniser from HuggingFace, we'll be using a custom Torch training loop and tail model customisation. The approach isn't the most straightforward but it is one way we can do it. We'll be utilising Massive dataset by Amazon and fine-tune the transformer encoder BERT

Run in Google Colab

Background

What is NER?

  • NER is a natural language processing technique which identifies and extracts named entities from unstructured text
  • Named entities refer to words or combination of words that represent specific objects, places etc, in principle it can be anything we define it to be
  • NER algorithms use Machine or Deep Learning algorithms to analyse text and recognise pattens that indicate the presence of a named entity

Applications

Named Entity Recognition has a wide range of applications in the field of Natural Language Processing and Information Retrieval.

Few such examples have been listed below1:

Classifying content for news providers:

A large amount of online content is generated by the news and publishing houses on a daily basis and managing them correctly can be a challenging task for the human workers. Named Entity Recognition can automatically scan entire articles and help in identifying and retrieving major people, organizations, and places discussed in them. Thus articles are automatically categorized in defined hierarchies and the content is also much easily discovered.

Automatically Summarizing Resumes:

You might have come across various tools that scan your resume and retrieve important information such as Name, Address, Qualification, etc from them. The majority of such tools use the NER software which helps it to retrieve such information. Also one of the challenging tasks faced by the HR Departments across companies is to evaluate a gigantic pile of resumes to shortlist candidates. A lot of these resumes are excessively populated in detail, of which, most of the information is irrelevant to the evaluator. Using the NER model, the relevant information to the evaluator can be easily retrieved from them thereby simplifying the effort required in shortlisting candidates among a pile of resumes.

Optimizing Search Engine Algorithms:

When designing a search engine algorithm, It would be an inefficient and computational task to search for an entire query across the millions of articles and websites online, an alternate way is to run a NER model on the articles once and store the entities associated with them permanently. Thus for a quick and efficient search, the key tags in the search query can be compared with the tags associated with the website articles

Powering Recommendation systems:

NER can be used in developing algorithms for recommender systems that make suggestions based on our search history or on our present activity. This is achieved by extracting the entities associated with the content in our history or previous activity and comparing them with the label assigned to other unseen content. Thus we frequently see the content of our interest.

Simplifying Customer Support:

Usually, a company gets tons of customer complaints and feedback on a daily basis, and going through each one of them and recognizing the concerned parties is not an easy task. Using NER we can recognize relevant entities in customer complaints and feedback such as Product specifications, department, or company branch location so that the feedback is classified accordingly and forwarded to the appropriate department responsible for the identified product.

The Dataset

To realise NER with Hugginface, we'll be utilising a muli-language dataset massive

Massive 1.1

Let's load our dataset MASSIVE; the dataset can be found on huggingface

MASSIVE 1.1 is a parallel dataset of > 1M utterances across 52 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.

We will be utilising this dataset for Named Entity Recognition, let's load our dataset, selecting only a subset of the data ru-RU, which is a subset of column locale

There are quite a number of ways we can extract data from this dataset, let's use the fastest way: - By specifying which locale group we want to load (so we don't load everything) - Define the subset of rows we want to use by using split (eg. train[10:100])

from datasets import load_dataset, Dataset

train_dataset = load_dataset('AmazonScience/massive', "ru-RU",split="train[:100]")
test_dataset = load_dataset('AmazonScience/massive', "ru-RU",split="test[:100]")

Relevant Columns

Some samples from our documents, located in column utt and NER Annotations are located in annot_utt, which is in the format [tag : tokens]

train_dataset['utt'][:10] 
['разбуди меня в девять утра в пятницу',
'поставь будильник на два часа вперед',
'олли тихо',
'отстановись',
'олли остановись на десять секунд',
'остановись на десять секунд',
'сделай освещение здесь чуть более тёплым',
'пожалуйста сделай свет подходящий для чтения',
'время идти спать',
'олли время спать']
train_dataset['annot_utt'][:10]
['разбуди меня в [time : девять утра] в [date : пятницу]',
'поставь будильник [time : на два часа вперед]',
'олли тихо',
'отстановись',
'олли остановись на [time : десять секунд]',
'остановись на [time : десять секунд]',
'сделай освещение здесь чуть более [color_type : тёплым]',
'пожалуйста сделай свет [color_type : подходящий для чтения]',
'время идти спать',
'олли время спать']

Text Preprocessing

Define Tokeniser & Model

For preprocessing we'll need a tokeniser, so let's define the tokeniser & the base model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("ai-forever/ruBert-base")
MODEL = AutoModel.from_pretrained("ai-forever/ruBert-base")
# tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")    # multilingual                      
# model = AutoModel.from_pretrained("bert-base-multilingual-cased")            # multilingual
# tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")               # en only
# model = AutoModel.from_pretrained("bert-base-uncased")                       # en only

NER format

NER tagging formats can vary quite a bit, let's utilise class Parser to preprocess the NER format in our dataset

Class Parser contains the relevant preprocessing steps we will take: - The class will be called when creating our Dataset, creating NER tags for each document - Its input will be the document & annotated document - The document itself is imported in order to create O tags - Whilst the annotated document is used to extract the annotated tags - Tag dictionaries are created tag_to_id, id_to_tag

from typing import List
import regex as re

'''

PARSER FOR THE DATASET NER TAG FORMAT

'''

class Parser:

    # RE patterns for tag extraction
    LABEL_PATTERN = r"\[(.*?)\]"
    PUNCTUATION_PATTERN = r"([.,\/#!$%\^&\*;:{}=\-_`~()'\"’¿])"

    # initialise, first word/id tag is O (outside)
    def __init__(self):
        self.tag_to_id = {
            "O": 0
        }
        self.id_to_tag = {
            0: "O"
        }

    '''

    CREATE TAGS

    '''
    # input : sentence, tagged sentence

    def __call__(self, sentence: str, annotated: str) -> List[str]:

        ''' Create Dictionary of Identified Tags'''

        # 1. set label B or I    

        matches = re.findall(self.LABEL_PATTERN, annotated)
        word_to_tag = {}
        for match in matches:
            tag, phrase = match.split(" : ")
            words = phrase.split(" ") 
            word_to_tag[words[0]] = f"B-{tag.upper()}"
            for w in words[1:]:
                word_to_tag[w] = f"I-{tag.upper()}"

        ''' Tokenise Sentence & add tags to not tagged words (O)'''

        # 2. add token tag to main tag dictionary

        tags = []
        sentence = re.sub(self.PUNCTUATION_PATTERN, r" \1 ", sentence)
        for w in sentence.split():
            if w not in word_to_tag:
                tags.append("O")
            else:
                tags.append(word_to_tag[w])
                self.__add_tag(word_to_tag[w])

        return tags

    '''

    TAG CONVERSION

    '''
    # to word2id (tag_to_id)
    # to id2word (id_to_tag)

    def __add_tag(self, tag: str):
        if tag in self.tag_to_id:
            return
        id_ = len(self.tag_to_id)
        self.tag_to_id[tag] = id_
        self.id_to_tag[id_] = tag

    ''' Get Tag Number ID '''
    # or just number id for token

    def get_id(self, tag: str):
        return self.tag_to_id[tag]

    ''' Get Tag Token from Number ID'''
    # given id get its token

    def get_label(self, id_: int):
        return self.get_tag_label(id_)

parser = Parser()
parser(train_dataset["utt"][0], train_dataset["annot_utt"][0])
['O', 'O', 'O', 'B-TIME', 'I-TIME', 'O', 'B-DATE']

Create Dataset

The functions NERDataset is our Dataset class, it requires the Parser class instance

Input into the parser orbjects are utt & its annotated version annot_utt

The output from parser will be a list of target variable tags BIO tags

['O', 'O', 'O', 'B-TIME', 'I-TIME', 'O', 'B-DATE']
['O', 'O', 'B-TIME', 'I-TIME', 'I-TIME', 'I-TIME']
['O', 'O']
['O']
['O', 'O', 'O', 'B-TIME', 'I-TIME']
['O', 'O', 'B-TIME', 'I-TIME']
['O', 'O', 'O', 'O', 'O', 'B-COLOR_TYPE']
['O', 'O', 'O', 'B-COLOR_TYPE', 'I-COLOR_TYPE', 'I-COLOR_TYPE']
['O', 'O', 'O']

The NERDataset output each item (document) looks like the following:

tmp[idx] = {
    **tokenizer_output,                             # BERT related token inputs
    "subword_group": torch.tensor(subword_group),   # token/word association (start,number of tokens in word)
    "target": torch.tensor(target)                  # word NER tag
}

Now the class itself:

from torch.utils.data import Dataset, DataLoader

class NERDataset(Dataset):
    def __init__(self, dataset, tokenizer):
        self.tokenizer = tokenizer
        self.processed_data = self.__preprocess(dataset)

    def __len__(self):
        return len(self.processed_data)

    def __getitem__(self, idx):
        return self.processed_data[idx]

    def __preprocess(self, dataset):

        tmp = {}
        for idx in tqdm(range(len(dataset))):
            item = dataset[idx]
            tags = parser(item["utt"], item["annot_utt"])     # get list of tags
            tokenizer_output = self.tokenizer(item["utt"],    # tokenise document (incl. <bos>,<eos>)
                                              padding=True, 
                                              truncation=True, 
                                              return_tensors='pt')

            # token word identifier (each word can have multiple tokens)
            word_ids = tokenizer_output.word_ids() 

            # for each word, how many subtokens are there (starts with 1 - first word)
            subword_group = [
                (key + 1, len(list(group))) 
                for key, group in itertools.groupby(word_ids) 
                    if key is not None
            ] # index to aggregate tokens

            # define bio tags for each word in numerical format using parser
            target = [parser.get_id(t) for t in tags] 

            # group all relevant data that will be used in forward pass
            tmp[idx] = {
                **tokenizer_output,
                "subword_group": torch.tensor(subword_group),
                "target": torch.tensor(target)
            }

            # check consistency
            try:
                assert (len(subword_group) == len(target))
            except:
                print(item["annot_utt"], subword_group, target)

        return tmp

train = NERDataset(train_dataset, tokenizer)
test = NERDataset(test_dataset, tokenizer)
  0%|          | 0/1000 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|██████████| 1000/1000 [00:00<00:00, 1690.12it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1598.76it/s]

BERT Multiclass Classifier

We will use a rather custom approach to NER, we have loaded our encoder model, so we'll need to create a tail end for it, so we can use it for NER multiclass classification

Model Architecture

Let's create a custom training & evaluation loop for our NER multiclass classification problem

  • In total we have 34 classification tags (including O), so we can define the tail output dimension
  • Training the BERT model using input_ids & attention_mask (tokens) we extract the hidden state for all tokens in document
  • Using the created subword_group data, we extract the token embeddings that correpospond to each word in the document
  • For words which have more than one token, we average token embeddings to obtain word embeddings
  • The word embeddings are passed to the linear tail classifier, which returns the logits for each word in the sentence
  • To get the most probable tag for each word in the document, we use torch.argmax

To utilise our custom model with methods such as .from_pretrained, let's add PreTrainedModel, which contains nn.Module

import torch.nn as nn
from transformers import PreTrainedModel
from transformers import PretrainedConfig
from transformers import AutoModel, AutoConfig

class MyConfig(PretrainedConfig):
    model_type = 'mymodel'
    def __init__(self, important_param=42, **kwargs):
        super().__init__(**kwargs)
        self.important_param = important_param

# PreTrainedModel has nn.Module

class NERClassifier(PreTrainedModel):

    config_class = MyConfig
    def __init__(self,config):
        super().__init__(config)
        self.bert = MODEL
        self.seq = nn.Sequential(
            nn.Linear(768, 256), 
            nn.ReLU(),
            nn.Linear(256, CLASSES),
        )

    def forward(self, inputs): # returns list of targets

        # standard inputs for BERT
        bert_output = self.bert(
            inputs["input_ids"],
            inputs["attention_mask"]
        )

        # output of transformer encoder will be our hidden state for 
        # each input_ids 
        last_hidden_state = bert_output["last_hidden_state"]

        # tokens correspond tokenizer divisions 
        # each word can be split into multiple tokens
        # ie. get the mean word embedding 

        target = []
        for group in inputs["subword_group"]:
            b, e = group
            word_embedding = last_hidden_state[:, b:b+e]       # get the token embeddings
            agg_embedding = torch.mean(word_embedding, dim=1)  # mean word embeddings for tokens

            # input mean word embedding (1,768) pass into nn.Sequential linear tail end
            proba = self.__forward_one(agg_embedding)    # logits data 
            target.append(proba)

        word_logits = torch.stack(target).squeeze(1)

        return word_logits

    def __forward_one(self, x):
        logits = self.seq(x)
        return logits

Model Additions

Aside from defining the classification model architecture for our base model BERT - We need to create dataloaders, instantiate our model (ie tail weights need to be initialised) - Define the model evaluation criterion (loss function) - Define the relevant optimiser to update weights

import torch.optim as optim

# define data loaders
train_loader = DataLoader(train)
test_loader = DataLoader(test)
CLASSES = len(parser.tag_to_id); print(f'{CLASSES} labels')
config = MyConfig(4)

# define classifier model, loss fucntion & optimiser
clf = NERClassifier(config).to(device)
criterion = nn.CrossEntropyLoss()  # for multiclass classification 
optimizer = optim.Adam(clf.parameters(), lr=1e-5)

In total, we have created 63 tags in word_to_tag

63 labels

Train Model

  • Let's train our model for 20 epochs on train_loader & validate the model using test_loader
  • To check how well the model is performing during training, we will use the accuracy metric
from sklearn.metrics import accuracy_score

for epoch in range(20):

    '''

    (1) TRAINING LOOP

    '''

    loss_count, loss_sum = 0, 0
    y_true, y_pred = [], []

    # switch to training mode, ie backpropagation on
    clf.train()
    for data in tqdm(train_loader):

        # move data to device
        inputs = {
            key: val.squeeze(0).to(device)
            for key, val in data.items()
        }

        # logits of belonging to each of the tag class 
        # for all words in document
        outputs = clf(inputs)

        # predicted word tag 
        word_tag = torch.argmax(outputs, dim=1).tolist() 

        y_true.extend(inputs["target"].tolist())
        y_pred.extend(word_tag)        

        # calcualate loss
        loss = criterion(outputs, inputs["target"])
        loss_count += 1
        loss_sum += loss.item()

#         nn.utils.clip_grad_norm_(
#             parameters=clf.parameters(), max_norm=20
#         )

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch-{epoch + 1}: loss: {loss_sum / loss_count}; acc: {accuracy_score(y_true, y_pred)}")

    '''

    (2) VALIDATION LOOP

    '''

    test_loss_sum, test_loss_count = 0, 0
    test_true, test_pred = [], []

    # switch to inference mode
    with torch.no_grad():
        for test_rows in tqdm(test_loader):

            # move data to device
            test_inputs = {
                key: val.squeeze(0).to(device)
                for key, val in test_rows.items()
            }
            test_outputs = clf(test_inputs)

            # add metric data
            test_true.extend(test_inputs["target"].tolist())
            test_pred.extend(torch.argmax(test_outputs, dim=1).tolist())

            test_loss = criterion(test_outputs, test_inputs["target"])
            test_loss_count += 1
            test_loss_sum += test_loss.item()


    print(f"Epoch-{epoch + 1}: loss: {loss_sum / loss_count}; acc: {accuracy_score(y_true, y_pred)},\
          val_loss: {test_loss_sum / test_loss_count}, val_acc: {accuracy_score(test_true, test_pred)}")
100%|██████████| 1000/1000 [01:02<00:00, 16.09it/s]
Epoch-1: loss: 1.6441781679093839; acc: 0.7374777271827361
100%|██████████| 1000/1000 [00:11<00:00, 88.19it/s]
Epoch-1: loss: 1.6441781679093839; acc: 0.7374777271827361,          val_loss: 1.2796487891450525, val_acc: 0.7421347230264428
100%|██████████| 1000/1000 [01:00<00:00, 16.50it/s]
Epoch-2: loss: 1.1323998833782971; acc: 0.7539101168085528
100%|██████████| 1000/1000 [00:11<00:00, 87.29it/s]
Epoch-2: loss: 1.1323998833782971; acc: 0.7539101168085528,          val_loss: 1.065644938390702, val_acc: 0.7639451843273499
100%|██████████| 1000/1000 [01:01<00:00, 16.32it/s]
Epoch-3: loss: 0.8694944252893329; acc: 0.7895466244308058
100%|██████████| 1000/1000 [00:11<00:00, 85.78it/s]
Epoch-3: loss: 0.8694944252893329; acc: 0.7895466244308058,          val_loss: 0.9206145468372852, val_acc: 0.7815093611271955
100%|██████████| 1000/1000 [00:58<00:00, 16.95it/s]
Epoch-4: loss: 0.6839374071029015; acc: 0.8204315977034251
100%|██████████| 1000/1000 [00:11<00:00, 84.64it/s]
Epoch-4: loss: 0.6839374071029015; acc: 0.8204315977034251,          val_loss: 0.8384006800730712, val_acc: 0.7915460335842501
100%|██████████| 1000/1000 [00:59<00:00, 16.81it/s]
Epoch-5: loss: 0.5615229318903293; acc: 0.8445852306473965
100%|██████████| 1000/1000 [00:11<00:00, 87.55it/s]
Epoch-5: loss: 0.5615229318903293; acc: 0.8445852306473965,          val_loss: 0.7858886199912521, val_acc: 0.7979154603358425
100%|██████████| 1000/1000 [00:59<00:00, 16.78it/s]
Epoch-6: loss: 0.4753220743876882; acc: 0.8643832904375371
100%|██████████| 1000/1000 [00:11<00:00, 88.00it/s]
Epoch-6: loss: 0.4753220743876882; acc: 0.8643832904375371,          val_loss: 0.8105039822029648, val_acc: 0.7971434086083767
100%|██████████| 1000/1000 [00:59<00:00, 16.81it/s]
Epoch-7: loss: 0.3950597488232888; acc: 0.8774500098990299
100%|██████████| 1000/1000 [00:11<00:00, 84.88it/s]
Epoch-7: loss: 0.3950597488232888; acc: 0.8774500098990299,          val_loss: 0.7834225822311127, val_acc: 0.8062150164060992
100%|██████████| 1000/1000 [00:59<00:00, 16.88it/s]
Epoch-8: loss: 0.33086000567564045; acc: 0.8948723025143536
100%|██████████| 1000/1000 [00:11<00:00, 84.73it/s]
Epoch-8: loss: 0.33086000567564045; acc: 0.8948723025143536,          val_loss: 0.767055139105185, val_acc: 0.8096892491796951
100%|██████████| 1000/1000 [01:00<00:00, 16.50it/s]
Epoch-9: loss: 0.28477367215882987; acc: 0.9103147891506632
100%|██████████| 1000/1000 [00:11<00:00, 85.45it/s]
Epoch-9: loss: 0.28477367215882987; acc: 0.9103147891506632,          val_loss: 0.7778430652543611, val_acc: 0.8029337965643698
100%|██████████| 1000/1000 [00:59<00:00, 16.81it/s]
Epoch-10: loss: 0.23926631732314127; acc: 0.9227875668184518
100%|██████████| 1000/1000 [00:11<00:00, 88.43it/s]
Epoch-10: loss: 0.23926631732314127; acc: 0.9227875668184518,          val_loss: 0.7841836358316068, val_acc: 0.8120054043620922
100%|██████████| 1000/1000 [00:59<00:00, 16.82it/s]
Epoch-11: loss: 0.20171416073056753; acc: 0.9362502474757474
100%|██████████| 1000/1000 [00:11<00:00, 84.28it/s]
Epoch-11: loss: 0.20171416073056753; acc: 0.9362502474757474,          val_loss: 0.849082443419844, val_acc: 0.8104613009071607
100%|██████████| 1000/1000 [00:59<00:00, 16.83it/s]
Epoch-12: loss: 0.17783702248180636; acc: 0.9459512967729162
100%|██████████| 1000/1000 [00:11<00:00, 87.20it/s]
Epoch-12: loss: 0.17783702248180636; acc: 0.9459512967729162,          val_loss: 0.7976406391558412, val_acc: 0.8046709129511678
100%|██████████| 1000/1000 [00:59<00:00, 16.88it/s]
Epoch-13: loss: 0.14918355378504203; acc: 0.9503068699267472
100%|██████████| 1000/1000 [00:11<00:00, 87.37it/s]
Epoch-13: loss: 0.14918355378504203; acc: 0.9503068699267472,          val_loss: 0.8160117758086053, val_acc: 0.8118123914302259
100%|██████████| 1000/1000 [01:00<00:00, 16.60it/s]
Epoch-14: loss: 0.1313198971484453; acc: 0.962383686398733
100%|██████████| 1000/1000 [00:11<00:00, 88.59it/s]
Epoch-14: loss: 0.1313198971484453; acc: 0.962383686398733,          val_loss: 0.8072414025840553, val_acc: 0.815479637135688
100%|██████████| 1000/1000 [00:59<00:00, 16.84it/s]
Epoch-15: loss: 0.11704726772185677; acc: 0.9633735893882399
100%|██████████| 1000/1000 [00:11<00:00, 85.19it/s]
Epoch-15: loss: 0.11704726772185677; acc: 0.9633735893882399,          val_loss: 0.8006598546078457, val_acc: 0.8152866242038217
100%|██████████| 1000/1000 [00:59<00:00, 16.79it/s]
Epoch-16: loss: 0.09824711217604636; acc: 0.9742625222728173
100%|██████████| 1000/1000 [00:11<00:00, 88.90it/s]
Epoch-16: loss: 0.09824711217604636; acc: 0.9742625222728173,          val_loss: 0.8438674144655451, val_acc: 0.8177957923180853
100%|██████████| 1000/1000 [00:59<00:00, 16.79it/s]
Epoch-17: loss: 0.0852395541092883; acc: 0.9750544446644229
100%|██████████| 1000/1000 [00:11<00:00, 88.00it/s]
Epoch-17: loss: 0.0852395541092883; acc: 0.9750544446644229,          val_loss: 0.8193156978896586, val_acc: 0.8143215595444895
100%|██████████| 1000/1000 [00:59<00:00, 16.83it/s]
Epoch-18: loss: 0.07844399727860218; acc: 0.9784201148287468
100%|██████████| 1000/1000 [00:11<00:00, 88.51it/s]
Epoch-18: loss: 0.07844399727860218; acc: 0.9784201148287468,          val_loss: 0.8360713951853868, val_acc: 0.8251302837290099
100%|██████████| 1000/1000 [00:59<00:00, 16.86it/s]
Epoch-19: loss: 0.07729116444718602; acc: 0.9788160760245496
100%|██████████| 1000/1000 [00:11<00:00, 85.81it/s]
Epoch-19: loss: 0.07729116444718602; acc: 0.9788160760245496,          val_loss: 0.8346432074703807, val_acc: 0.8210770121598147
100%|██████████| 1000/1000 [00:59<00:00, 16.82it/s]
Epoch-20: loss: 0.06690881398832789; acc: 0.9798059790140566
100%|██████████| 1000/1000 [00:11<00:00, 88.31it/s]
Epoch-20: loss: 0.06690881398832789; acc: 0.9798059790140566,          val_loss: 0.8577917314651385, val_acc: 0.8183748311136846
100%|██████████| 1000/1000 [00:59<00:00, 16.80it/s]
Epoch-21: loss: 0.05367027178355602; acc: 0.985349435755296
100%|██████████| 1000/1000 [00:11<00:00, 88.59it/s]
Epoch-21: loss: 0.05367027178355602; acc: 0.985349435755296,          val_loss: 0.8853590850097044, val_acc: 0.8235861802740784
100%|██████████| 1000/1000 [00:59<00:00, 16.79it/s]
Epoch-22: loss: 0.04562243979116829; acc: 0.9879231835280142
100%|██████████| 1000/1000 [00:11<00:00, 88.02it/s]
Epoch-22: loss: 0.04562243979116829; acc: 0.9879231835280142,          val_loss: 0.8947100080056626, val_acc: 0.8251302837290099
100%|██████████| 1000/1000 [00:59<00:00, 16.80it/s]
Epoch-23: loss: 0.04487149683444659; acc: 0.9875272223322115
100%|██████████| 1000/1000 [00:11<00:00, 83.96it/s]
Epoch-23: loss: 0.04487149683444659; acc: 0.9875272223322115,          val_loss: 0.9012678128228526, val_acc: 0.8257093225246092
100%|██████████| 1000/1000 [00:59<00:00, 16.83it/s]
Epoch-24: loss: 0.043597472023681805; acc: 0.9869332805385073
100%|██████████| 1000/1000 [00:11<00:00, 88.64it/s]
Epoch-24: loss: 0.043597472023681805; acc: 0.9869332805385073,          val_loss: 0.9172096501152982, val_acc: 0.8272534259795407
100%|██████████| 1000/1000 [00:59<00:00, 16.78it/s]
Epoch-25: loss: 0.039683220311884725; acc: 0.9879231835280142
100%|██████████| 1000/1000 [00:11<00:00, 89.10it/s]
Epoch-25: loss: 0.039683220311884725; acc: 0.9879231835280142,          val_loss: 0.9694385796532524, val_acc: 0.8204979733642154
100%|██████████| 1000/1000 [00:59<00:00, 16.79it/s]
Epoch-26: loss: 0.03646317584309327; acc: 0.9899029895070283
100%|██████████| 1000/1000 [00:11<00:00, 88.90it/s]
Epoch-26: loss: 0.03646317584309327; acc: 0.9899029895070283,          val_loss: 0.9861661683519687, val_acc: 0.8266743871839414
100%|██████████| 1000/1000 [00:59<00:00, 16.75it/s]
Epoch-27: loss: 0.03880691091310382; acc: 0.9893090477133241
100%|██████████| 1000/1000 [00:11<00:00, 84.44it/s]
Epoch-27: loss: 0.03880691091310382; acc: 0.9893090477133241,          val_loss: 0.9058586812789617, val_acc: 0.8210770121598147
100%|██████████| 1000/1000 [00:59<00:00, 16.83it/s]
Epoch-28: loss: 0.037079696985995725; acc: 0.9887151059196199
100%|██████████| 1000/1000 [00:11<00:00, 87.97it/s]
Epoch-28: loss: 0.037079696985995725; acc: 0.9887151059196199,          val_loss: 0.925899648670832, val_acc: 0.8272534259795407
100%|██████████| 1000/1000 [00:59<00:00, 16.94it/s]
Epoch-29: loss: 0.023646224649064154; acc: 0.9932686596713522
100%|██████████| 1000/1000 [00:11<00:00, 88.34it/s]
Epoch-29: loss: 0.023646224649064154; acc: 0.9932686596713522,          val_loss: 0.9826786444298395, val_acc: 0.8230071414784791
100%|██████████| 1000/1000 [00:59<00:00, 16.94it/s]
Epoch-30: loss: 0.030056740869996247; acc: 0.9914868342902395
100%|██████████| 1000/1000 [00:11<00:00, 88.16it/s]
Epoch-30: loss: 0.030056740869996247; acc: 0.9914868342902395,          val_loss: 1.0284815851225262, val_acc: 0.8245512449334106

Save Model

We can save the model for future use

clf.save_pretrained('./my_model_dir')
tokenizer.save_pretrained('./my_model_dir')
('./my_model_dir/tokenizer_config.json',
 './my_model_dir/special_tokens_map.json',
 './my_model_dir/vocab.txt',
 './my_model_dir/added_tokens.json',
 './my_model_dir/tokenizer.json')

Using our NER tagger

We can use our NER tagger by loading a stack of documents and creating a dataset, as we did for the train & test subsets or we can use it for a single sentence, I'll be using a simple inference function.

Loading Model

If we want to load the fine-tuned model, we can use .from_pretrained and specify the folder where ther model is located, the tokenizer can be loaded in a similar way

new_model = NERClassifier.from_pretrained('./my_model_dir')
new_model
NERClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(120138, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (seq): Sequential(
    (0): Linear(in_features=768, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=63, bias=True)
  )
)

Finding NER tags in a document

We can use our model using something similar to the code below & utilise the same tokenizer, the parser & saved model

def ner_inference(text,model):

    # Tokenise input
    tokenizer_output = tokenizer(text,    
                                 padding=True, 
                                 truncation=True, 
                                 return_tensors='pt')

    # token word identifier (each word can have multiple tokens)
    word_ids = tokenizer_output.word_ids() 

    # for each word, how many subtokens are there (starts with 1 - first word)
    subword_group = [
        (key + 1, len(list(group))) 
        for key, group in itertools.groupby(word_ids) 
            if key is not None
    ] # index to aggregate tokens

    # group all relevant data that will be used in forward pass 
    output = {
        **tokenizer_output,
        "subword_group": torch.tensor(subword_group),
    }

    # get the highest logits value
    tag_pred = torch.argmax(model(output),axis=1).tolist()
    tag_pred 

    return tag_pred

ner_inference('В девять утра я улетаю в Зимбабве',new_model)

The model predicts the following tags, which we can map using parser.id_to_tag

[0, 2, 0, 0, 0, 0, 0]

REFERENCES


  1. https://www.mygreatlearning.com/blog/named-entity-recognition/#