Get probability of multi-token word in MASK position

Question

It is relatively easy to get a token's probability according to a language model, as the snippet below shows. You can get the output of a model, restrict yourself to the output of the masked token, and then find the probability of your requested token in the output vector. However, this only works with single-token words, e.g. words that are themselves in the tokenizer's vocabulary. When a word does not exist in the vocabulary, the tokenizer will chunk it up into pieces that it does know (see the bottom of the example). But since the input sentence consists of only one masked position, and the requested token has more tokens than that, how can we get its probability? Ultimately I am looking for a solution that works regardless of the number of subword units a word has.

In the code below I have added many comments explaining what is going on, as well as printing out the given output of print statements. You'll see that predicting tokens such as 'love' and 'hate' is straightforward because they are in the tokenizer's vocabulary. 'reprimand' is not, though, so it cannot be predicted in a single masked position - it consists of three subword units. So how can we predict 'reprimand' in the masked position?

from transformers import BertTokenizer, BertForMaskedLM
import torch

# init model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()
# init softmax to get probabilities later on
sm = torch.nn.Softmax(dim=0)
torch.set_grad_enabled(False)

# set sentence with MASK token, convert to token_ids
sentence = f"I {tokenizer.mask_token} you"
token_ids = tokenizer.encode(sentence, return_tensors='pt')
print(token_ids)
# tensor([[ 101, 1045,  103, 2017,  102]])
# get the position of the masked token
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().item()

# forward
output = model(token_ids)
last_hidden_state = output[0].squeeze(0)
# only get output for masked token
# output is the size of the vocabulary
mask_hidden_state = last_hidden_state[masked_position]
# convert to probabilities (softmax)
# giving a probability for each item in the vocabulary
probs = sm(mask_hidden_state)

# get probability of token 'hate'
hate_id = tokenizer.convert_tokens_to_ids('hate')
print('hate probability', probs[hate_id].item())
# hate probability 0.008057191967964172

# get probability of token 'love'
love_id = tokenizer.convert_tokens_to_ids('love')
print('love probability', probs[love_id].item())
# love probability 0.6704086065292358

# get probability of token 'reprimand' (?)
reprimand_id = tokenizer.convert_tokens_to_ids('reprimand')
# reprimand is not in the vocabulary, so it needs to be split into subword units
print(tokenizer.convert_ids_to_tokens(reprimand_id))
# [UNK]

reprimand_id = tokenizer.encode('reprimand', add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(reprimand_id))
# ['rep', '##rim', '##and']
# but how do we now get the probability of a multi-token word in a single-token position?

score 5 · Accepted Answer · edited Jan 30 '20 at 01:50

Since the split word does not present in the dictionary, BERT is simply unaware of it's probability, so there is no use of masking it before tokenization.

And you can't get it's probability by exploiting rule of chain, see response by J.Devlin. To illustrate it, let's take more generic example. Try to estimate the probability of some bigram in position i. While you can estimate probability of each word given the sentence and their positions

P(w_i|w_0, w_1... w_i-1, w_i+1, ..., w_N),

P(w_i+1|w_0, w_1... w_i, wi+2, ..., w_N),

there is no way to get the probability of the bigram

P(w_i,w_i+1|w_0, w_1... w_i-1, wi+2, ..., w_N)

because BERT does not store such information.

Having said all that, you can get a very rough estimate of the probability of your OOV word by multiplying probabilities of seeing it's parts. So you will get

P("reprimand"|...) ~= P("rep"|...)*P("##rim"|...)*P("##and"|...)

Since your subwords are not regular words, but a special kind of words, this is not all wrong, because the dependency between them is implicit.

What do you mean by 'BERT does not store such information'? I don't quite understand why BERT cannot model bigrams. Are there other language models that can? — Bram Vanroy, Dec 30 '19 at 11:06
1) The probability of a bigram is `P(w1,w2)=P(w1)P(w2|w1)!=P(w1)*P(w2)`. BERT does not store conditional probabilities of each word. BERT is **not a language model** in its traditional meaning. BERT can't provide a probability of specific sentence. 2) You can take (for example) n-gram language model for getting bigram probability. But no matter what model you take, you will have issues when modelling rare words. Because of their nature, it is hard (you need lots of data) to estimate their conditional probabilities. So usually all rare words get some minimum probability, and it work fine. — igrinis, Dec 30 '19 at 13:05

Todd Cook · Answer 2 · 2020-02-11T00:30:39.727

Instead of sentence = f"I {tokenizer.mask_token} you", predict on: "I [MASK] [MASK] you" and "I [MASK] [MASK] [MASK] you" and filter results, dropping whole word token chains, so that you find only chains of suitable subwords. Of course you're going to get better results if you provide more than two surrounding context words.

But before you embark on that, reconsider your softmax. With dimension=0, it does a softmax calculation across all the token columns and all the token rows--not just the single token for which you want the softmax probability:

In [1]: import torch                                                                                                                      
In [2]: m = torch.nn.Softmax(dim=1) 
   ...: input = torch.randn(2, 3) 
   ...: input                                                                                                                        
Out[2]: 
tensor([[ 1.5542,  0.3776, -0.8047],
        [-0.3856,  1.1327, -0.1252]])

In [3]: m(input)                                                                                                                          
Out[3]: 
tensor([[0.7128, 0.2198, 0.0674],
        [0.1457, 0.6652, 0.1891]])

In [4]: soft = torch.nn.Softmax(dim=0) 
   ...: soft(input)                                                                                                                       
Out[4]: 
tensor([[0.8743, 0.3197, 0.3364],
        [0.1257, 0.6803, 0.6636]])

Get probability of multi-token word in MASK position

2 Answers2