Difference between Dense(2) and Dense(1) as the final layer of a binary classification CNN?

Question

In a CNN for binary classification of images, should the shape of output be (number of images, 1) or (number of images, 2)? Specifically, here are 2 kinds of last layer in a CNN:

keras.layers.Dense(2, activation = 'softmax')(previousLayer)

or

keras.layers.Dense(1, activation = 'softmax')(previousLayer)

In the first case, for every image there are 2 output values (probability of belonging to group 1 and probability of belonging to group 2). In the second case, each image has only 1 output value, which is its label (0 or 1, label=1 means it belongs to group 1).

Which one is correct? Is there intrinsic difference? I don't want to recognize any object in those images, just divide them into 2 groups.

Thanks a lot!

The second code snippet only produces the constant value 1.0, you can't use softmax with a single neuron. — Dr. Snoopy, Jun 12 '18 at 06:01

score 7 · Accepted Answer · edited Jun 12 '18 at 09:56

7

This first one is the correct solution:

keras.layers.Dense(2, activation = 'softmax')(previousLayer)

Usually, we use the softmax activation function to do classification tasks, and the output width will be the number of the categories. This means that if you want to classify one object into three categories with the labels A,B, or C, you would need to make the Dense layer generate an output with a shape of (None, 3). Then you can use the cross_entropyloss function to calculate the LOSS, automatically calculate the gradient, and do the back-propagation process.

If you want to only generate one value with the Dense layer, that means you get a tensor with a shape of (None, 1) - so it produces a single numeric value, like a regression task. You are using the value of the output to represent the category. The answer is correct, but does not perform like the general solution of the classification task.

edited Jun 12 '18 at 09:56

dynamicwebpaige

184
8

answered Jun 12 '18 at 03:55

Ember Xu

196
1
8

3

Do you know why the documentation (https://keras.io/getting-started/sequential-model-guide/) suggests the opposite way (only one output)?. For me it works the same if I use a final dense layer with one output dimension and then `binary_crossentropy`, or a final dense layer with two output dimensions and `sparse_categorical_crossentropy`. – KLaz Oct 24 '18 at 15:48
7

@KLaz Actually, when we want to do classification task, we will choose the loss function based on the number of classification categories. In Keras, if we want to do a **two categories classification**, we usually use `Dense(1, activation='sigmoid', name='output')` as the last node, and we will compile the Model use `binary_crossentropy` loss function. But when we want to do a multi-classes classification task, we choose `Dense(4, activation='softmax', name='output')` as the output node, and correspondingly we choose `categorical_crossentropy` as the loss function. – Ember Xu Nov 15 '18 at 06:42
1

@KLaz I think maybe it just a habit to do this task, because it can generate same result. – Ember Xu Nov 15 '18 at 06:42
@NeoXu, I have 2 class disease classification (X and non-X), I used 2 in the last dense layer, `sigmoid` activation afterwards.For loss I used `mean_squared_error`. Before that train and test labels were converted `to_categorical`. Do you see it a right go? Or I should not convert `to_categorical` at all and use 1 in dense, `sigmoid` afterwards, `binary_crossentropy` as loss? – bit_scientist Dec 06 '19 at 09:55

rajesh · Answer 2 · 2020-03-16T12:20:29.207

The difference is if the class probabilities are independent of each other (multi-label classification) or not.

When there are 2 classes and you generally have P(c=1) + P(c=0) = 1 then

keras.layers.Dense(2, activation = 'softmax') 

keras.layers.Dense(1, activation = 'sigmoid')

both are correct in terms of class probabilities. The only difference being how you supply the labels during training. But

keras.layers.Dense(2, activation = 'sigmoid')

is incorrect in that context. However, it is correct implementation if you have P(c=1) + P(c=0) != 1. This is the case for multi-label classification where an instance may belong to more than one correct class.

Difference between Dense(2) and Dense(1) as the final layer of a binary classification CNN?

2 Answers2

Linked