43

I can't give the correct number of parameters of AlexNet or VGG Net.

For example, to calculate the number of parameters of a conv3-256 layer of VGG Net, the answer is 0.59M = (3*3)*(256*256), that is (kernel size) * (product of both number of channels in the joint layers), however in that way, I can't get the 138M parameters.

So could you please show me where is wrong with my calculation, or show me the right calculation procedure?

stop-cran
  • 3,546
  • 2
  • 25
  • 40
Eric
  • 541
  • 1
  • 5
  • 7
  • 1
    Please give your full calculation of all layers so we can see what's wrong. Here's a starting point to see how to calculate the total number: http://learning.eng.cam.ac.uk/pub/Public/Turner/Teaching/ml-lecture-3-slides.pdf – runDOSrun Jan 30 '15 at 16:40
  • What does `S` stand for in the 10th slide of the lecture? `Stride` for the subsampling?@runDOSrun – nn0p Feb 05 '15 at 14:48
  • 2
    Here is my answer to a similar question here: http://stackoverflow.com/a/39687866/1621562. You can write a simple python script to compute the total number of parameters (if you are using Caffe). See https://gist.github.com/kaushikpavani/a6a32bd87fdfe5529f0e908ed743f779 – Kaushik Pavani Sep 25 '16 at 14:24

5 Answers5

65

If you refer to VGG Net with 16-layer (table 1, column D) then 138M refers to the total number of parameters of this network, i.e including all convolutional layers, but also the fully connected ones.

Looking at the 3rd convolutional stage composed of 3 x conv3-256 layers:

  • the first one has N=128 input planes and F=256 output planes,
  • the two other ones have N=256 input planes and F=256 output planes.

The convolution kernel is 3x3 for each of these layers. In terms of parameters this gives:

  • 128x3x3x256 (weights) + 256 (biases) = 295,168 parameters for the 1st one,
  • 256x3x3x256 (weights) + 256 (biases) = 590,080 parameters for the two other ones.

As explained above you have to do that for all layers, but also the fully-connected ones, and sum these values to obtain the final 138M number.

-

UPDATE: the breakdown among layers give:

conv3-64  x 2       : 38,720
conv3-128 x 2       : 221,440
conv3-256 x 3       : 1,475,328
conv3-512 x 3       : 5,899,776
conv3-512 x 3       : 7,079,424
fc1                 : 102,764,544
fc2                 : 16,781,312
fc3                 : 4,097,000
TOTAL               : 138,357,544

In particular for the fully-connected layers (fc):

 fc1 (x): (512x7x7)x4,096 (weights) + 4,096 (biases)
 fc2    : 4,096x4,096     (weights) + 4,096 (biases)
 fc3    : 4,096x1,000     (weights) + 1,000 (biases)

(x) see section 3.2 of the article: the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).

Details about fc1

As precised above the spatial resolution right before feeding the fully-connected layers is 7x7 pixels. This is because this VGG Net uses spatial padding before convolutions, as detailed within section 2.1 of the paper:

[...] the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3×3 conv. layers.

With such a padding, and working with a 224x224 pixels input image, the resolution decreases as follow along the layers: 112x112, 56x56, 28x28, 14x14 and 7x7 after the last convolution/pooling stage which has 512 feature maps.

This gives a feature vector passed to fc1 with dimension: 512x7x7.

deltheil
  • 13,936
  • 1
  • 40
  • 60
  • I ran the numbers for fun using the same assumption but I don't get 138M as well. Suppose I have a typo somewhere since it's a rather long expression. – runDOSrun Jan 30 '15 at 19:28
  • 1
    I got 14,714,688 parameters for the convolutional layers and 123,642,856 for the fully-connected which gives 138M in total. – deltheil Jan 30 '15 at 22:01
  • The `fc1` padding thing is exactly where I got stuck. Thank you for the very detailed explaination. – Eric Feb 01 '15 at 11:04
  • in todays architecture, should we include batch normalization/scale layer parameters as well? in caffe the BN layer is split into two BN and scale layer. which actually doubles the number of parameters I guess. so layers like BN (caffe style) and drop-out need to be included in this calculation as well ? – Rika Jun 05 '16 at 05:14
  • 1
    @deltheil: How can one calculate the number of operations (mul /sum) in an architecture such as this one? – Rika Mar 16 '17 at 10:37
46

A great breakdown of the calculation for VGG-16 network is also given in CS231n lecture notes.

INPUT:     [224x224x3]    memory:  224*224*3=150K   weights: 0
CONV3-64:  [224x224x64]   memory:  224*224*64=3.2M  weights: (3*3*3)*64 = 1,728
CONV3-64:  [224x224x64]   memory:  224*224*64=3.2M  weights: (3*3*64)*64 = 36,864
POOL2:     [112x112x64]   memory:  112*112*64=800K  weights: 0
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M weights: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M weights: (3*3*128)*128 = 147,456
POOL2:     [56x56x128]    memory:  56*56*128=400K   weights: 0
CONV3-256: [56x56x256]    memory:  56*56*256=800K   weights: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256]    memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256]    memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
POOL2:     [28x28x256]    memory:  28*28*256=200K   weights: 0
CONV3-512: [28x28x512]    memory:  28*28*512=400K   weights: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512]    memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512]    memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
POOL2:     [14x14x512]    memory:  14*14*512=100K   weights: 0
CONV3-512: [14x14x512]    memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]    memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]    memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
POOL2:     [7x7x512]      memory:  7*7*512=25K      weights: 0
FC:        [1x1x4096]     memory:  4096             weights: 7*7*512*4096 = 102,760,448
FC:        [1x1x4096]     memory:  4096             weights: 4096*4096 = 16,777,216
FC:        [1x1x1000]     memory:  1000             weights: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
Piotr Dabkowski
  • 4,878
  • 4
  • 35
  • 45
Ray
  • 2,382
  • 16
  • 21
3

The below VGG-16 architechture is in the original paper as highlighted by @deltheil in (table 1, column D) , and I quote from there

2.1 ARCHITECTURE

During training, the input to our ConvNets is a fixed-size 224 × 224 RGB images. The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel.

The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2 × 2 pixel window, with stride 2.

A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class).

The final layer is the soft-max layer.

Using the above, and

  • A formula to find activation shape of a layer!

  • A formula to calculate the weights corresponding to every layer:

Note:

  • you can simply multiply respective activation shape column to get the activation size

  • CONV3: means a filter of 3*3 will convolve on the input!

  • MAXPOOL3-2: means, 3rd pooling layer, with 2*2 filter, stride=2, padding=0(pretty standard in pooling layers)

  • Stage-3 : means it has multiple CONV layer stacked! with same padding=1, , stride=1, and filter 3*3

  • Cin : means the depth a.k.a channel coming from the input layer!

  • Cout: means the depth a.k.a channel outgoing (you configure it differently- to learn more complex features!),

Cin and Cout are the number of filters that you stack together to learn multiple features at different scales such as in the first layer you might want to learn vertical edges, and horizontal edges and edges at say 45degree, blah blah!, 64 possible different filters each of different kind of edges!!

  • n: input dimension without depth such n=224 in case of INPUT-image!

  • p: padding for each layer

  • s: stride used for each layer

  • f: filter size i.e 3*3 for CONV and 2*2 for MAXPOOL layers!

  • After MAXPOOL5-2, you simply flatten the volume and interface it with the first FC layer.!

We get the table: enter image description here

Finally, if you add all the weights calculated in the last column, you end up with 138,357,544(138 million) parameters to train for VGG-15!

Anu
  • 2,019
  • 4
  • 16
  • 35
2

Here is how to compute the number of parameters in each cnn layer:
some definitions
n--width of filter
m--height of filter
k--number of input feature maps
L--number of output feature maps
Then number of paramters #= (n*m *k+1)*L in which the first contribution is from weights and the second is from bias.

saunter
  • 43
  • 1
  • 7
1

I know this is a old post nevertheless, I think the accepted answer by @deltheil contains a mistake. If not, I would be happy to be corrected. The convolution layer should not have bias. i.e. 128x3x3x256 (weights) + 256 (biases) = 295,168 should be 128x3x3x256 (weights) = 294,9112

Thanks

rav
  • 19
  • 4
  • You should rather add the comment to the answer. – Sahil Mittal Jul 05 '17 at 11:51
  • Yes, I wanted to but couldn't due to reputation. – rav Jul 05 '17 at 12:14
  • 1
    Correct me if I am wrong but by default Caffe uses biases for convolutional layers[1] (bias_term [default true]) and the official VGG 16 pre-trained model[2] uses such default (there is no bias_term: false in the layers definition). [1]: http://caffe.berkeleyvision.org/tutorial/layers/convolution.html [1]: https://gist.githubusercontent.com/ksimonyan/211839e770f7b538e2d8/raw/0067c9b32f60362c74f4c445a080beed06b07eb3/VGG_ILSVRC_16_layers_deploy.prototxt – deltheil Oct 14 '17 at 12:06