8

I am attempting to update the pre-trained BERT model using an in house corpus. I have looked at the Huggingface transformer docs and I am a little stuck as you will see below.My goal is to compute simple similarities between sentences using the cosine distance but I need to update the pre-trained model for my specific use case.

If you look at the code below, which is precisely from the Huggingface docs. I am attempting to "retrain" or update the model and I assumed that special_token_1 and special_token_2 represent "new sentences" from my "in house" data or corpus. Is this correct? In summary, I like the already pre-trained BERT model but I would like to update it or retrain it using another in house dataset. Any leads will be appreciated.

import tensorflow as tf
import tensorflow_datasets
from transformers import *

model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

SPECIAL_TOKEN_1="dogs are very cute"
SPECIAL_TOKEN_2="dogs are cute but i like cats better and my 
brother thinks they are more cute"

tokenizer.add_tokens([SPECIAL_TOKEN_1, SPECIAL_TOKEN_2])
model.resize_token_embeddings(len(tokenizer))
#Train our model
model.train()
model.eval()
Bram Vanroy
  • 22,919
  • 16
  • 101
  • 195
user8291021
  • 150
  • 6
  • You tagged the question with PyTorch, but your code imports TensorFlow. What framework are you planning to use? You also tagged the question with spacy, but I don't actually see where is spacy used. Can you clarify that? (If you are using PyTorch, I'll be glad to answer the question.) – Jindřich Oct 30 '19 at 09:18
  • Jindrich, I am not 100% sure, but as far as I know hugging-face provides pytorch code to provide APIs for using the new SOTA NLP models. – Ashwin Geet D'Sa Oct 30 '19 at 13:07
  • @user8291021, I am not sure how to do it using hugging face APIs, but if you want I can tell you how to finetune the pre-trained bert model using MLM on your custom data. – Ashwin Geet D'Sa Oct 30 '19 at 13:08
  • @Jindřich Sorry for the mislabeling - sure should have included tensorflow in the labels. The docs are included on this link https://github.com/huggingface/transformers and the main piece that's abit unclear for me is ; tokenizer.add_tokens([SPECIAL_TOKEN_1, SPECIAL_TOKEN_2]) . I assume the special tokens represent sentences or text from the new training data or "in house" corpus? – user8291021 Oct 30 '19 at 17:10
  • @AshwinGeetD'Sa can you show me how to fine-tune using MLM on my data? – Nauman Naeem Feb 06 '20 at 14:22
  • @NaumanNaeem, It is well explained here https://github.com/google-research/bert#pre-training-with-bert – Ashwin Geet D'Sa Feb 06 '20 at 16:08
  • @AshwinGeetD'Sa It is pre-training, not fine-tuning. – Nauman Naeem Feb 07 '20 at 08:56
  • Fine-tuning does not involve MLM. If you want to pre-train completely, you can start from scratch. Otherwise, you can run MLM on existing pre-trained model. (Which I believe is what you are looking for) – Ashwin Geet D'Sa Feb 07 '20 at 09:21

1 Answers1

3

BERT is pre-trained on 2 tasks: masked language modeling (MLM) and next sentence prediction (NSP). The most important of those two is MLM (it turns out that the next sentence prediction task is not really that helpful for the model's language understanding capabilities - RoBERTa for example is only pre-trained on MLM).

If you want to further train the model on your own dataset, you can do so by using BERTForMaskedLM in the Transformers repository. This is BERT with a language modeling head on top, which allows you to perform masked language modeling (i.e. predicting masked tokens) on your own dataset. Here's how to use it:

from transformers import BertTokenizer, BertForMaskedLM 
import torch   

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') 
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True) 

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt") 
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]

outputs = model(**inputs, labels=labels) 
loss = outputs.loss 
logits = outputs.logits

You can update the weights of BertForMaskedLM using loss.backward(), which is the main way of training PyTorch models. If you don't want to do this yourself, the Transformers library also provides a Python script which allows you perform MLM really quickly on your own dataset. See here (section "RoBERTa/BERT/DistilBERT and masked language modeling"). You just need to provide a training and test file.

You don't need to add any special tokens. Examples of special tokens are [CLS] and [SEP], which are used for sequence classification and question answering tasks (among others). These are added by the tokenizer automatically. How do I know this? Because BertTokenizer inherits from PretrainedTokenizer, and if you take a look at the documentation of its __call__ method here, you can see that the add_special_tokens parameter defaults to True.

Niels
  • 613
  • 9
  • 10