29

The Situation:

I am wondering how to use TensorFlow optimally when my training data is imbalanced in label distribution between 2 labels. For instance, suppose the MNIST tutorial is simplified to only distinguish between 1's and 0's, where all images available to us are either 1's or 0's. This is straightforward to train using the provided TensorFlow tutorials when we have roughly 50% of each type of image to train and test on. But what about the case where 90% of the images available in our data are 0's and only 10% are 1's? I observe that in this case, TensorFlow routinely predicts my entire test set to be 0's, achieving an accuracy of a meaningless 90%.

One strategy I have used to some success is to pick random batches for training that do have an even distribution of 0's and 1's. This approach ensures that I can still use all of my training data and produced decent results, with less than 90% accuracy, but a much more useful classifier. Since accuracy is somewhat useless to me in this case, my metric of choice is typically area under the ROC curve (AUROC), and this produces a result respectably higher than .50.

Questions:

(1) Is the strategy I have described an accepted or optimal way of training on imbalanced data, or is there one that might work better?

(2) Since the accuracy metric is not as useful in the case of imbalanced data, is there another metric that can be maximized by altering the cost function? I can certainly calculate AUROC post-training, but can I train in such a way as to maximize AUROC?

(3) Is there some other alteration I can make to my cost function to improve my results for imbalanced data? Currently, I am using a default suggestion given in TensorFlow tutorials:

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

I have heard this may be possible by up-weighting the cost of miscategorizing the smaller label class, but I am unsure of how to do this.

MJoseph
  • 464
  • 1
  • 5
  • 11
  • Have you solved your problem? I have a similiar problem and I am currently experimenting with a) Dropout 50% at hidden1, b) L2 regularization at loss and c) removing the most prominent 90% class and calculating with the 10% evenly distributed classes. – Frank Apr 22 '16 at 15:47
  • I never did find a better solution than the taking random batches. For practicality, I ended up abandoning neural nets altogether in favor of tree-based methods implemented in scikit-learn. Here, there are built-in cross-validation methods that can optimize on AUROC which solves the imbalance problem beautifully. They also run much faster than TensorFlow since I have plenty of CPUs but no GPU. – MJoseph Apr 23 '16 at 17:03
  • https://www.tensorflow.org/tutorials/structured_data/imbalanced_data – Amin habibi Aug 04 '20 at 12:20

4 Answers4

8

(1)It's ok to use your strategy. I'm working with imbalanced data as well, which I try to use down-sampling and up-sampling methods first to make the training set even distributed. Or using ensemble method to train each classifier with an even distributed subset.

(2)I haven't seen any method to maximise the AUROC. My thought is that AUROC is based on true positive and false positive rate, which doesn't tell how well it works on each instance. Thus, it may not necessarily maximise the capability to separate the classes.

(3)Regarding weighting the cost by the ratio of class instances, it similar to Loss function for class imbalanced binary classifier in Tensor flow and the answer.

Uni
  • 3,102
  • 4
  • 14
  • 43
Young
  • 244
  • 1
  • 3
  • 10
5

Regarding imbalanced datasets, the first two methods that come to mind are (upweighting positive samples, sampling to achieve balanced batch distributions).

Upweighting positive samples This refers to increasing the losses of misclassified positive samples when training on datasets that have much fewer positive samples. This incentivizes the ML algorithm to learn parameters that are better for positive samples. For binary classification, there is a simple API in tensorflow that achieves this. See (weighted_cross_entropy) referenced below

Batch Sampling This involves sampling the dataset so that each batch of training data has an even distribution positive samples to negative samples. This can be done using the rejections sampling API provided from tensorflow.

Convergii
  • 63
  • 1
  • 5
4

I'm one who struggling with imbalanced data. What my strategy to counter imbalanced data are as below.

1) Use cost function calculating 0 and 1 labels at the same time like below.

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(_pred) + (1-y)*tf.log(1-_pred), reduction_indices=1))

2) Use SMOTE, oversampling method making number of 0 and 1 labels similar. Refer to here, http://comments.gmane.org/gmane.comp.python.scikit-learn/5278

Both strategy worked when I tried to make credit rating model.

Logistic regression is typical method to handle imbalanced data and binary classification such as predicting default rate. AUROC is one of the best metric to counter imbalanced data.

2

1) Yes. This is well received strategy to counter imbalanced data. But this strategy is good in Neural Nets only if you using SGD.

Another easy way to balance the training data is using weighted examples. Just amplify the per-instance loss by a larger weight/smaller when seeing imbalanced examples. If you use online gradient descent, it can be as simple as using a larger/smaller learning rate when seeing imbalanced examples.

Not sure about 2.

Farseer
  • 3,510
  • 2
  • 34
  • 52