Why do we have normally more than one fully connected layers in the late steps of the CNNs?

Question

As I noticed, in many popular architectures of the convolutional neural networks (e.g. AlexNet), people use more than one fully connected layers with almost the same dimension to gather the responses to previously detected features in the early layers.

Why do not we use just one FC for that? Why this hierarchical arrangement of the fully connected layers is possibly more useful?

I think the late steps should combine the prevous ones in a non-linear manner. Of course the cybenko theorem holds and tells us, that one hidden-layer is capable enough, but like everywhere in deep-learning, you want to trade network-depth (higher) for better chances to get a better learned layer/layers. There is a lot of work explaining, why this should work better. A common example is the learning-capability of parity-functions, where more layers are just working better. — sascha, May 14 '16 at 13:20
The convolutional layers extract features, and then the fully connected layers combine the features in order to model the outputs. But the higher the number of fully connected layers, the more complex and powerful the NN, but the higher the risks of overfitting. Caution: 1 fully connected layer with 2N neurons, does not model the same things as two layers with N neurons. — FiReTiTi, May 14 '16 at 17:50

score 2 · Answer 1 · answered Nov 03 '16 at 19:37

Because there are some functions, such as XOR, that can't be modeled by a single layer. In this type of architecture the convolutional layers are computing local features and the fully-connected output layer(s) are then combining these local features to derive the final outputs.. So, you can consider the fully-connected layers as a semi-independent mapping of features to outputs, and if this is a complex mapping then you may need the expressive power of multiple layers.

score 1 · Answer 2 · answered Jul 25 '16 at 20:33

Actually its no longer popular/normal. 2015+ networks(such as Resnet, Inception 4) uses Global average pooling(GAP) as a last layer + softmax, which gives same performance and mach smaller model. Last 2 layers in VGG16 is about 80% of all parameters in network. But to answer you question its common to use 2 layer MLP for classification and consider the rest of network to be feature generation. 1 layer would be normal logistic regression with global minimum and simple properties, 2 layers give some usefulness to have non linearity and usage of SGD.

is there any research which supports your last two sentences? — AHA, Sep 17 '16 at 07:04

Why do we have normally more than one fully connected layers in the late steps of the CNNs?

2 Answers2