BERT sentence embeddings from transformers

Question

I'm trying to get sentence vectors from hidden states in a BERT model. Looking at the huggingface BertModel instructions here, which say:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained("bert-base-multilingual-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt') 
output = model(**encoded_input)

So first note, as it is on the website, this does /not/ run. You get:

>>> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'BertTokenizer' object is not callable

But it looks like a minor change fixes it, in that you don't call the tokenizer directly, but ask it to encode the input:

encoded_input = tokenizer.encode(text, return_tensors="pt")
output = model(encoded_input)

OK, that aside, the tensors I get, however, have a different shape than I expected:

>>> output[0].shape
torch.Size([1,11,768])

This is a lot of layers. Which is the correct layer to use for sentence embeddings? [0]? [-1]? Averaging several? I have the goal of being able to do cosine similarity with these, so I need a proper 1xN vector rather than an NxK tensor.

I see that the popular bert-as-a-service project appears to use [0]

Is this correct? Is there documentation for what each of the layers are?

Regarding `TypeError: 'BertTokenizer' object is not callable` you probably have installed an older version of transformers. — cronoik, Aug 18 '20 at 16:11

Jindřich · Accepted Answer · 2020-08-18T16:31:40.297

7

I don't think there is single authoritative documentation saying what to use and when. You need to experiment and measure what is best for your task. Recent observations about BERT are nicely summarized in this paper: https://arxiv.org/pdf/2002.12327.pdf.

I think the rule of thumb is:

Use the last layer if you are going to fine-tune the model for your specific task. And finetune whenever you can, several hundred or even dozens of training examples are enough.
Use some of the middle layers (7-th or 8-th) if you cannot finetune the model. The intuition behind that is that the layers first develop a more and more abstract and general representation of the input. At some point, the representation starts to be more target to the pre-training task.

Bert-as-services uses the last layer by default (but it is configurable). Here, it would be [:, -1]. However, it always returns a list of vectors for all input tokens. The vector corresponding to the first special (so-called [CLS]) token is considered to be the sentence embedding. This where the [0] comes from in the snipper you refer to.

edited Aug 18 '20 at 16:31

answered Aug 18 '20 at 08:37

Jindřich

6,222
2
8
24

Does it make sense to aggregate multiple layers, say the last and the second to last? Is a simple arithmetic mean appropriate for that operation or no? – Mittenchops Aug 18 '20 at 14:31
1

It certainly does. In some sense, the last layer contains all the previous layers, because the model is interconnected via residual connections, i.e., after each layer, the output of the layer is summed up with the previous one. Due to the residual connections, the layers are sort of commensurable, and averaging them is just changing the ratio in which the layers were mixed previously. – Jindřich Aug 18 '20 at 16:35
Sorry, and the layers are ordered such that to get the /last/ 3 layers, that would be something like: `>>> output[0][:,-4:-1,:].shape.` For `torch.Size([1, 3, 768])` Right? – Mittenchops Aug 20 '20 at 03:15
1

Exactly. (Btw. instead of `-4-:1`, you can only write `-4:`.) – Jindřich Aug 20 '20 at 07:41
And sorry to revive an old question, but the layer subset is for sure the middle dimension of the output[0] object? This appears to vary depending on the document length. – Mittenchops Oct 05 '20 at 23:59
@Jindřich do you know how can I pass multiple texts instead of one. For example, instead of: text = "Replace me by any text you'd like", a list of texts such as text =["First text", "Second text"] – Alfredo_MF Nov 27 '20 at 06:28

cronoik · Answer 2 · 2021-01-22T17:55:13.103

5

While the existing answer of Jindrich is generally correct, it does not address the question entirely. The OP asked which layer he should use to calculate the cosine similarity between sentence embeddings and the short answer to this question is none. A metric like cosine similarity requires that the dimensions of the vector contribute equally and meaningfully, but this is not the case for BERT weights released by the original authors. Jacob Devlin (one of the authors of the BERT paper) wrote:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

However, that does not mean you can not use BERT for such a task. It just means that you can not use the pre-trained weights out-of-the-box. You can either train a classifier on top of BERT which learns which sentences are similar (using the [CLS] token) or you can use sentence-transformers which can be used in an unsupervised scenario because they were trained to produce meaningful sentence representations.

edited Jan 22 '21 at 17:55

answered Oct 07 '20 at 04:50

cronoik

7,605
2
21
47

sentence-transformers is still limited to sentences, right? It doesn’t apply to multi-sentence documents without the same kind of failing BERT has composing from words to documents as well? – Mittenchops Oct 07 '20 at 06:34
1

No, you can use it for whole paragraphs. @Mittenchops – cronoik Oct 07 '20 at 13:07
This is quite an interesting question. So, in order to look for similar sentences, you would not use the output from BERT embeddings and try to use cosine similarity, am I right? But what if the idea is instead of looking for similar sentences but to look for similar words? I retrieve the embedding of the word and try to look for similar embeddings on other sentence. – Borja_042 Jan 21 '21 at 10:11
1

@Borja_042 No that is not what I said here. I said the original BERT weights released by google were never intended to be used for finding similar sequences. You need some weights for BERT that are trained for this task. This is what sentence-transformer project does. They release weights that are trained for such an objective. Regarding your other question, are you looking for a way to determine the similarity of a word in the context of a sentence or just for synonyms? – cronoik Jan 22 '21 at 14:17
@cronoik Thanks for your answer. When you say you need some weights for BERT that are trained for this task you mean retrain a new Bert? or use something already pretrained from another place? My task now is to search entities on plain text, to do so I am doing embeddings from the name of the fields I want to look for and I use Bert as well to convert the plain text into vectors. Once I have those 2 vectors I retrieve the most similar words to the fields I want to look for. I do not know if Bert and this method is a valid approach to this problem. Perhaps you can guid me a bit. Thanks a lot! – Borja_042 Jan 22 '21 at 15:32
@Borja_042 You don't need to train BERT from scratch. As written in the answer you can either finetune the BERT with a similarity task or use the weights provided by the sentence-transformers project. The other question is not really really suited for stackoverflow. Maybe you can post it in the [huggingface forum](https://discuss.huggingface.co/) with a small example. – cronoik Jan 22 '21 at 17:53

BERT sentence embeddings from transformers

2 Answers2

Linked