Below, I use channels and feature maps interchangeably.
I'm trying to better understand how 1x1 convolution works with multiple input channels and have yet to find a good explanation of this. Before getting into 1x1, I'd like to ensure my understanding of 2D vs 3D convolution. Let's look at a simplistic example of 2D convolution in Keras API:
i = Input(shape=(64,64,3))
x = Conv2D(filters=32,kernel_size=(3,3),padding='same',activation='relu') (i)
In the above example, the input image has 3 channels and the convolutional layer will produce 32 feature maps. Will the 2D convolutional layer apply 3 different kernels to each of the 3 input channels to generate each feature map? If so, this means the number of kernels used in each 2D convolutional operation = #input channels * #feature maps. In this case, 96 different kernels would be used to produce 32 feature maps.
Now let's look at 3D convolution:
i = Input(shape=(1,64,64,3))
x = Conv3D(filters=32,kernel_size=(3,3,3),padding='same',activation='relu') (i)
In the above example, based on my current understanding, each kernel is convolved with all input channels simultaneously. Therefore, the # of kernels used in each 3D convolution operation = #input channels. In this case, 32 different kernels would be used to produce 32 feature maps.
I understand the purpose of downsampling channels before computations with bigger kernels (3x3, 5x5, 7x7). I'm asking because I'm confused as to how 1x1 convolutions preserve learned features. Let's look at a 1x1 convolution:
i = Input(shape=(64,64,3))
x = Conv2D(filters=32,kernel_size=(3,3),padding='same',activation='relu') (i)
x = Conv2D(filters=8,kernel_size=(1,1),padding='same',activation='relu') (x)
If my above understanding of 2D convolutions is correct, then the 1x1 convolutional layer will use 32 different kernels to generate each feature map. This operation would use a total of 256 kernels (32*8) to generate 8 feature maps. Each feature map computation essentially combines 32 pixels into one. How does this one pixel somehow retain all of the features from the previous 32 pixels?
CodePudding user response:
A 1x1 convolution is a 2D convolution just with a "kernel size" of 1. Since there is no sense of spatial neighborhoods, like in a 3x3 kernel, how they are able to learn spatial features depends on the architecture.
By the way, the difference in a 2D convolution and a 3D convolution is in the movement of the convolution. A 2D convolution correlates the filter along "x and y" and is learning (kernel x kernel x input_channel) parameters per output channel. A 3D convolution correlates along "x, y, and z" and is learning (kernel x kernel x kernel x input_channel) parameters per output channel. You could do a 3D convolution on an image with channels, but it doesn't really make sense because we already know the "depth" is correlated. 3D convolutions are generally used with geometric volumes, e.g. data from a CT scan.
Maybe this link would be helpful https://medium.com/analytics-vidhya/talented-mr-1x1-comprehensive-look-at-1x1-convolution-in-deep-learning-f6b355825578