Use of attention_mask during the forward pass in lm finetuning

Question

I had a question about the language model finetuning code on the Hugging Face repository. It seems that the forward method of the BERT model takes as input an argument called attention_mask.

The documentation says that the attention mask is an optional argument used when batching sequences together. This argument indicates to the model which tokens should be attended to, and which should not. For example, the tokenizer encoding methods return this attention mask, a binary tensor indicating the position of the padded indices so that the model does not attend to them, which makes sense.

If I'm not wrong, however, I do not see the usage of this attention mask in the code for language model finetuning. Indeed during the forward pass, only the input ids are given as input, cf this code.

My questions is: does this mean that we do not kill the attention on the padding tokens during the training ? Does it make sense to take them into account ? Or maybe I missed something in the code ?

Thank you very much for your answer :)

EDIT

I noticed that the way Hugging Face is building the dataset leads to no padding needs at all (see this code )

dennlinger · Answer 1 · 2020-02-27T09:17:05.890

I agree that this is indeed a confusing formulation, but the solution lies beyond the huggingface part in PyTorch. When looking at the implementation that you linked to, you will find that the BertTokenizer has an attribute pad_value, which also replaces the padded values with a specific index. This is then likely passed down to PyTorch, which can treat padding values itself, e.g. in pad_sequence.

Indeed, pad_sequence is also called earlier in the language modeling example, see here. This should be enough for the model to ignore the tokens during the forward pass, without specifically calling for a separate attention mask.

EDIT: Something that I was confused about is that usually the loss is specifically instructed to not compute the loss on specific values, which is not the case here (since the loss is never explicitly calculated but just returned by the model), which prompted me to look into this a bit further. As I understand, there is also the labels, which also contain a specific token, see the parameter masked_lm_labels here. Together, this should form a pretty clear picture on how these values are ignored, without specifically requiring the attention mask. That one seems to be more used for the "actual masking" (i.e., the training objective of BERT), which is different from the "padding masking" considered in your question.

Thanks for your answer! About the loss part, I think that there is no problem with the padding because: 1/ the loss function of BERT LM is the CrossEntropy which ignores index -100. 2/ only masked labels are taken into account in the loss, [see](https://github.com/huggingface/transformers/blob/8bcb37bfb80d77e06001f989ad982c9961a69c31/examples/run_language_modeling.py#L218) . 3/ we ensure that the padding tokens are not masked [here](https://github.com/huggingface/transformers/blob/8bcb37bfb80d77e06001f989ad982c9961a69c31/examples/run_language_modeling.py#L216) — Pauline Chavallard, Feb 27 '20 at 17:27

Use of attention_mask during the forward pass in lm finetuning

1 Answers1