I'm reading a book where a section introduces how kernel works in CNN: https://freecontent.manning.com/deep-learning-for-image-like-data/.
Sliding a kernel over an image and requiring that the whole kernel is at each position completely within the image, yields to an activation map with reduced dimensions. For example, if you’ve a 3 x 3 kernel on all sides, one pixel is knocked off in the resulting activation map; in case of a 5 x 5 kernel, even two pixels.
What does it mean here to have one or two pixels that is knocked off?
CodePudding user response:
They mean, that without extra padding, using 3x3 kernel will "loose" one pixel per side in the output. So if your input image is NxN the output will be (N-2)x(N-2).
For example witn N=5 you can see that when the kernel "fits" into lower right corner its center is "one pixel off in both horizontal and vertical axes".
a a a a a . . . . .
a a a a a . b b b .
a a x x x ===> . b b b .
a a x X x . b b B .
a a x x x . . . . .
5 x 5 3 x 3
To avoid this issue various padding strategies are used, e.g. to "surround your picture" with 0s so that size is preserved
0 0 0 0 0 0 0 . . . . . . .
0 a a a a a 0 . b b b b b .
0 a a a a a 0 . b b b b b .
0 a a a a a 0 ===> . b b b b b .
0 a a a x x x . b b b b b .
0 a a a x X x . b b b b B .
0 0 0 0 x x x . . . . . . .
5 x 5 pad 5 x 5