I had a question about the language model finetuning code on the Hugging Face repository. It seems that the forward method of the BERT model takes as input an argument called attention_mask.
The documentation says that the attention mask is an optional argument used when batching sequences together. This argument indicates to the model which tokens should be attended to, and which should not. For example, the tokenizer encoding methods return this attention mask, a binary tensor indicating the position of the padded indices so that the model does not attend to them, which makes sense.
If I'm not wrong, however, I do not see the usage of this attention mask in the code for language model finetuning. Indeed during the forward pass, only the input ids are given as input, cf this code.
My questions is: does this mean that we do not kill the attention on the padding tokens during the training ? Does it make sense to take them into account ? Or maybe I missed something in the code ?
Thank you very much for your answer :)
EDIT
I noticed that the way Hugging Face is building the dataset leads to no padding needs at all (see this code )