What is the difference between BinaryCrossentropy(from_logits=True) as loss instead of softmax activation at the last layer?

Question

While building a NN model, if we are working on a classification problem, as far as I know, we need an activation function at the last layer . In the tutorial (https://www.tensorflow.org/tutorials/images/cnn?hl=tr) it says "...then add one or more Dense layers on top. CIFAR has 10 output classes, so you use a final Dense layer with 10 outputs and a softmax activation." The model is:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import models, layers

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))

So, my question is that where is the softmax activation function in the model? That is also the same in (https://www.tensorflow.org/tutorials/images/classification?hl=tr). That is binary classification problem and there is no activation function at the last layer. Besides,

with the model above, which methods can I use directly : model.predict_classes() model.predict() model.predict_proba()
Why/when/ in what types of situations I would prefer using above structure instead of last layer having a activation="softmax" parameter?

Thanks.

Please provide a [mre]. I suspect that the employed loss function takes plain logits, and already includes the softmax activation. — SE_net4 the downvoter, Apr 26 '20 at 13:57

score 1 · Answer 1 · answered Apr 27 '20 at 21:43

The softmax activation function comes in the last layer but the question of when to use it depends of your data, labels, loss function and optimizer. actually the only thing that this function adds, is making your output of last layer to probability values. That answers your question which prediction to use. For example if you have multi-class classification and integer labels like [0,1,2] not having the softmax may make your training faster. But again it depends on your optimizer. I had an experience where I had to use Adamax without softmax for the best result.

predictions = model(features)
predictions[:5]

<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[-0.24120538, -6.219783  , -0.28271127],
       [-0.3727467 , -7.456743  , -0.2836677 ],
       [ 0.29496336, -5.0277514 , -1.2696607 ],
       [ 0.27997503, -4.6681266 , -1.181448  ],
       [-0.23913011, -5.8720627 , -0.23690107]], dtype=float32)>

tf.nn.softmax(predictions[:5])
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[5.0971615e-01, 1.2908182e-03, 4.8899299e-01],
       [4.7755370e-01, 4.0038861e-04, 5.2204597e-01],
       [8.2369196e-01, 4.0191957e-03, 1.7228886e-01],
       [8.0710059e-01, 5.7278876e-03, 1.8717149e-01],
       [4.9855182e-01, 1.7838521e-03, 4.9966437e-01]], dtype=float32)>

Thanks for the explanation. There is another link about the logits. https://stackoverflow.com/questions/34240703/what-is-logits-softmax-and-softmax-cross-entropy-with-logits — iaai, Apr 28 '20 at 09:26

What is the difference between BinaryCrossentropy(from_logits=True) as loss instead of softmax activation at the last layer?

1 Answers1