I am a little confused about the mathematics of solving the output of a second convolutional layer. I have an output of the first convolutional layer of shape (11,11,64), and now I have a second convolutional layer where kernel specifications are 64 filters with 3x3 size, the stride is 1, and padding is 'same'. When I check the model summary and all, it shows the kernel of the second convolutional layer has a shape (3,3,64,64) but the output shape of the second convolutional layer is (11,11,64). So I am confused here about how to get (11,11,64). I checked the internet, and they say that the convolution will result in a 11x11x1 shape because of stacking, and for 64 images, it will be 11,11,64. So what is the mathematics behind getting the shape 11x11x1? I could only understand the shape should result in 11,11,64,64. Please help me to understand since I need to code this algorithm for hardware.
CodePudding user response:
You start with 64 images (more precisely, 1 "image" with 64 channels, but we'll stick with 64 images for simplicity), each has a size of 11 * 11:
I1 :: 11 * 11
I2 :: 11 * 11
...
I64 :: 11 * 11
Then we have the convolution kernel. Assume the kernel shape is 1 * 64 * 11 * 11, then for each input image (again, it should be "channel" technically), there is a corresponding kernel:
K1 :: 3 * 3
...
K64 :: 3 * 3
Then we calculate convolution between I1 and K1, I2 and K2, ..., I64 and K64. Now it looks like we have sixty-four 11 * 11 results, but actually, we ADD them together into a single one: O1 = K1 * I1 ... K64 * I64 where * means convolution. That is where the 1 * 11 * 11 comes from.
Finally, since the actual kernel shape is 64 * 64 * 11 * 11, the output has the shape 64 * 11 * 11:
O1 = K1_1 * I1 ... K64_1 * I64
O2 = K1_2 * I1 ... K64_2 * I64
...
O64 = K1_64 * I1 ... K64_64 * I64
I hope it makes things somewhat clearer. Coincidentally, I am doing some coding on hardware as well, and I was learning those last month.
CodePudding user response:
This might help you
input_layer2.shape == (11, 11, 64)
kernel_layer2.shape == (3, 3, 64, 64)
input_layer2[:3, :3].shape == (3, 3, 64)
kernel_layer2[:,:,:,0].shape == (3, 3, 64)
its only for output_layer2[0, 0]:
for i in range(64):
output_layer2[0, 0, i] = np.sum(np.dot(input_layer2[:3, :3], kernel_layer2[:,:,:,i]))
finaly for all stride:
output_layer2.shape == (11, 11, 64)