I am trying BertForSequenceClassification for a simple article classification task.

No matter how I train it (freeze all layers but the classification layer, all layers trainable, last k layers trainable), I always get an almost randomized accuracy score. My model doesn't go above 24-26% training accuracy (I only have 5 classes in my dataset).

I'm not sure what did I do wrong while designing/training the model. I tried the model with multiple datasets, every time it gives the same random baseline accuracy.

Dataset I used: BBC Articles (5 classes)


Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. Natural Classes: 5 (business, entertainment, politics, sport, tech)

I added the model part and the training part which are the most important portion (to avoid any irrelevant details). I added the full source-code + data too if that's useful for reproducibility.

My guess is there is something wrong with the I way I designed the network or the way I'm passing the attention_masks/ labels to the model. Also, the token length 512 should not be a problem as most of the texts has length < 512 (the mean length is < 300).

Model code:

import torch
from torch import nn

class BertClassifier(nn.Module):
    def __init__(self):
        super(BertClassifier, self).__init__()
        self.bert = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 5)
        # as we have 5 classes

        # we want our output as probability so, in the evaluation mode, we'll pass the logits to a softmax layer
        self.softmax = torch.nn.Softmax(dim = 1) # last dimension
    def forward(self, x, attn_mask = None, labels = None):

        if self.training == True:
            # print(x.shape)
            loss = self.bert(x, attention_mask = attn_mask, labels = labels)
            # print(x[0].shape)

            return loss

        if self.training == False: # in evaluation mode
            x = self.bert(x)
            x = self.softmax(x[0])

            return x
    def freeze_layers(self, last_trainable = 1): 
        # we freeze all the layers except the last classification layer + few transformer blocks
        for layer in list(self.bert.parameters())[:-last_trainable]:
            layer.requires_grad = False

# create our model

bertclassifier = BertClassifier()

Training code:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # cuda for gpu acceleration

# optimizer

optimizer = torch.optim.Adam(bertclassifier.parameters(), lr=0.001)

epochs = 15

bertclassifier.to(device) # taking the model to GPU if possible

# metrics

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

train_losses = []

train_metrics = {'acc': [], 'f1': []}
test_metrics = {'acc': [], 'f1': []}

# progress bar

from tqdm import tqdm_notebook

for e in tqdm_notebook(range(epochs)):
    train_loss = 0.0
    train_acc = 0.0
    train_f1 = 0.0
    batch_cnt = 0


    print(f'epoch: {e+1}')

    for i_batch, (X, X_mask, y) in tqdm_notebook(enumerate(bbc_dataloader_train)):
        X = X.to(device)
        X_mask = X_mask.to(device)
        y = y.to(device)


        loss, y_pred = bertclassifier(X, X_mask, y)

        train_loss += loss.item()

        y_pred = torch.argmax(y_pred, dim = -1)

        # update metrics
        train_acc += accuracy_score(y.cpu().detach().numpy(), y_pred.cpu().detach().numpy())
        train_f1 += f1_score(y.cpu().detach().numpy(), y_pred.cpu().detach().numpy(), average = 'micro')
        batch_cnt += 1

    print(f'train loss: {train_loss/batch_cnt}')

    test_loss = 0.0
    test_acc = 0.0
    test_f1 = 0.0
    batch_cnt = 0

    with torch.no_grad():
        for i_batch, (X, y) in enumerate(bbc_dataloader_test):
            X = X.to(device)
            y = y.to(device)

            y_pred = bertclassifier(X) # in eval model we get the softmax output so, don't need to index

            y_pred = torch.argmax(y_pred, dim = -1)

            # update metrics
            test_acc += accuracy_score(y.cpu().detach().numpy(), y_pred.cpu().detach().numpy())
            test_f1 += f1_score(y.cpu().detach().numpy(), y_pred.cpu().detach().numpy(), average = 'micro')
            batch_cnt += 1


Full source-code with the dataset is available here: https://github.com/zabir-nabil/pytorch-nlp/blob/master/bert-article-classification.ipynb


After observing the prediction, it seems model almost always predicts 0:

with torch.no_grad():
    for i_batch, (X, y) in enumerate(bbc_dataloader_test):
        X = X.to(device)
        y = y.to(device)

        y_pred = bertclassifier(X) # in eval model we get the softmax output so, don't need to index

        y_pred = torch.argmax(y_pred, dim = -1)

tensor([4, 2, 2, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 0, 3, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 0, 0, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 4, 4, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 3, 2, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 3, 3, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 1, 4, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 0, 0, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 3, 1, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 2, 4, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 3, 1, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 0, 1, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 0, 1, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 3, 1, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 2, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 1, 2, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 4, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 3, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 3, 0, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 3, 2, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 3, 1, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 2, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 3, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 4, 2, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 4, 4, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 1, 3, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 3, 2, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 0, 0, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 1, 4, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 4, 3, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 2, 1, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 3, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 4, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 1, 1, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 2, 4, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 3, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 2, 3, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 3, 0, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 3, 1, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 2, 2, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 3, 2, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 3, 2, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 3, 0, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 1, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 4, 0, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 3, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 3, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 2, 0, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 0, 0, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 0, 2, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 2, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 2, 3, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 3, 0, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 0, 0, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 0, 2, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 4, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 0, 4, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 0, 3, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 2, 0, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 3, 1, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 1, 3, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 3, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 3, 0, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 2, 3, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 0, 0, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 0, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 1, 1, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 1, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 4, 1, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 3, 2, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 3, 4, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([3, 0, 4, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 1, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 4, 3, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 0, 3, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 3, 3, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 0, 3, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 0, 1, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([1, 2, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([2, 0, 4, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([4, 2, 4, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
tensor([0, 0, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')

Actually, the model is always predicting the same output [0.2270, 0.1855, 0.2131, 0.1877, 0.1867] for any input, it's like it didn't learn anything at all.

It's weird because my dataset is not imbalanced.

Counter({'politics': 417,
         'business': 510,
         'entertainment': 386,
         'tech': 401,
         'sport': 511})
Zabir Al Nazi
  • 8,008
  • 2
  • 16
  • 34
  • Please describe the dataset and samples you are using in your question as well, to maintain the requirements of a [mcve] for future reference. – dennlinger May 23 '20 at 10:58
  • I attached the `github` link with full code and the dataset and I clearly mentioned it. It's an article classification task which I mentioned so it's just plain English text data. About the reproducibility, unfortunately it's not possible to add the full code here (technically it is but I will add too many irrelevant part), the only important part of the training scheme is the model and the training block I assume, and the full code (reproducible) is already in the github. – Zabir Al Nazi May 23 '20 at 11:05
  • I understand and please take this as a mere suggestion. The idea is that it is unclear whether the code will be available in 10 years of time, but the comment basically mentions everything important. That being said, are you getting better results with "baseline" models? And could the input limitation of transformer models to 512 tokens only be a problem in your case? – dennlinger May 23 '20 at 11:06
  • 1
    I tried with a bidirectional LSTM (keras) which got much better accuracy so I think there's something wrong with the way I designed the model or the way I'm passing attention_masks/labels. I don't think so, most of the texts have length < 512, and the mean is also below that. So, I'm not sure about that. – Zabir Al Nazi May 23 '20 at 11:11

1 Answers1


After some digging I found out, the main culprit was the learning rate, for fine-tuning bert 0.001 is extremely high. When I reduced my learning rate from 0.001 to 1e-5, both my training and test accuracy reached 95%.

When BERT is fine-tuned, all layers are trained - this is quite different from fine-tuning in a lot of other ML models, but it matches what was described in the paper and works quite well (as long as you only fine-tune for a few epochs - it's very easy to overfit if you fine-tune the whole model for a long time on a small amount of data!)

src: https://github.com/huggingface/transformers/issues/587

Best result is found when all the layers are trained with a really small learning rate.

src: https://github.com/uzaymacar/comparatively-finetuning-bert

Zabir Al Nazi
  • 8,008
  • 2
  • 16
  • 34
  • 2
    This is something I've noticed with the PyTorch transformer examples as well. They use 1e-3 as the default learning rate but I found that I have to use at least 1e-4 or it doesn't learn – David Waterworth Sep 08 '20 at 01:08
  • This is also noted in the [BERT paper](https://arxiv.org/abs/1810.04805) in section A3: recommended learning rate 5e-5, 3e-5, 2e-5. – cronoik May 12 '21 at 21:03