3D convolutional autoencoder is not returning the right output shape-CodePudding

I'm trying to use an autoencoder on spatiotemporal data. My data shape is: batches , filters, timesteps, rows, columns. I have problem with setting the autoencoder to the right shape.

This is my model:

input_imag = Input(shape=(3, 81, 4, 4))

x = Conv3D(16, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(input_imag)
x = MaxPooling3D((3, 2, 2), data_format='channels_first', padding='same')(x)
x = Conv3D(8, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(x)
x = MaxPooling3D((3, 2, 2), data_format='channels_first', padding='same')(x)
x = Conv3D(4, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(x)
encoded = MaxPooling3D((3, 2, 2), data_format='channels_first', padding='same', name='encoder')(x)

x = Conv3D(4, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(encoded)
x = UpSampling3D((3, 2, 2), data_format='channels_first')(x)
x = Conv3D(8, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(x)
x = UpSampling3D((3, 2, 2), data_format='channels_first')(x)
x = Conv3D(16, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(x)
x = UpSampling3D((3, 2, 2), data_format='channels_first')(x)
decoded = Conv3D(3, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(x)

autoencoder = Model(input_imag, decoded)
autoencoder.compile(optimizer='adam', loss='mse')

autoencoder.summary()

This is the summary:

Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         [(None, 3, 81, 4, 4)]     0
_________________________________________________________________
conv3d (Conv3D)              (None, 16, 81, 4, 4)      2176
_________________________________________________________________
max_pooling3d (MaxPooling3D) (None, 16, 27, 2, 2)      0
_________________________________________________________________
conv3d_1 (Conv3D)            (None, 8, 27, 2, 2)       5768
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 8, 9, 1, 1)        0
_________________________________________________________________
conv3d_2 (Conv3D)            (None, 4, 9, 1, 1)        1444
_________________________________________________________________
encoder (MaxPooling3D)       (None, 4, 3, 1, 1)        0
_________________________________________________________________
conv3d_3 (Conv3D)            (None, 4, 3, 1, 1)        724
_________________________________________________________________
up_sampling3d (UpSampling3D) (None, 4, 9, 2, 2)        0
_________________________________________________________________
conv3d_4 (Conv3D)            (None, 8, 9, 2, 2)        1448
_________________________________________________________________
up_sampling3d_1 (UpSampling3 (None, 8, 27, 4, 4)       0
_________________________________________________________________
conv3d_5 (Conv3D)            (None, 16, 27, 4, 4)      5776
_________________________________________________________________
up_sampling3d_2 (UpSampling3 (None, 16, 81, 8, 8)      0
_________________________________________________________________
conv3d_6 (Conv3D)            (None, 3, 81, 8, 8)       2163
=================================================================
Total params: 19,499
Trainable params: 19,499
Non-trainable params: 0

What I should change to have the decoder output shape as [?,3,81,4,4] not [?,3,81,8,8] ?

CodePudding user response：

It looks like you want the MaxPooling3D and UpSampling3D operations to be symmetrical (at least in terms of output shapes). Let's look at the input shape of the last MaxPooling3D layer:

conv3d_2 (Conv3D)            (None, 4, 9, 1, 1)        1444
_________________________________________________________________
encoder (MaxPooling3D)       (None, 4, 3, 1, 1)        0

The shape is (None, 4, 9, 1, 1). The last two dimensions are already 1, so they can't be divided by 2, as specified in pool_size. So MaxPooling3D layer, despite having a pool_size=(3, 2, 2), effectively does an operation with pool_size=(3, 1, 1). At least I think that is what happens under the hood.

I'm a bit surprised there is no error or warning when specifying pool_size greater than input size.

To fix that you can set the first UpSampling3D layer's shape to (3, 1, 1)

x = UpSampling3D((3, 1, 1), data_format='channels_first')(x)

So, the complete solution:

input_imag = Input(shape=(3, 81, 4, 4))

x = Conv3D(16, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(input_imag)
x = MaxPooling3D((3, 2, 2), data_format='channels_first', padding='same')(x)
x = Conv3D(8, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(x)
x = MaxPooling3D((3, 2, 2), data_format='channels_first', padding='same')(x)
x = Conv3D(4, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(x)
encoded = MaxPooling3D((3, 2, 2), data_format='channels_first', padding='same', name='encoder')(x)

x = Conv3D(4, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(encoded)
x = UpSampling3D((3, 1, 1), data_format='channels_first')(x)
x = Conv3D(8, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(x)
x = UpSampling3D((3, 2, 2), data_format='channels_first')(x)
x = Conv3D(16, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(x)
x = UpSampling3D((3, 2, 2), data_format='channels_first')(x)
decoded = Conv3D(3, (5, 3, 3), data_format='channels_first', activation='relu', padding='same')(x)

autoencoder = Model(input_imag, decoded)
autoencoder.compile(optimizer='adam', loss='mse')

autoencoder.summary()

Output:

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, 3, 81, 4, 4)]     0         
                                                                 
 conv3d_14 (Conv3D)          (None, 16, 81, 4, 4)      2176      
                                                                 
 max_pooling3d_4 (MaxPooling  (None, 16, 27, 2, 2)     0         
 3D)                                                             
                                                                 
 conv3d_15 (Conv3D)          (None, 8, 27, 2, 2)       5768      
                                                                 
 max_pooling3d_5 (MaxPooling  (None, 8, 9, 1, 1)       0         
 3D)                                                             
                                                                 
 conv3d_16 (Conv3D)          (None, 4, 9, 1, 1)        1444      
                                                                 
 encoder (MaxPooling3D)      (None, 4, 3, 1, 1)        0         
                                                                 
 conv3d_17 (Conv3D)          (None, 4, 3, 1, 1)        724       
                                                                 
 up_sampling3d_6 (UpSampling  (None, 4, 9, 1, 1)       0         
 3D)                                                             
                                                                 
 conv3d_18 (Conv3D)          (None, 8, 9, 1, 1)        1448      
                                                                 
 up_sampling3d_7 (UpSampling  (None, 8, 27, 2, 2)      0         
 3D)                                                             
                                                                 
 conv3d_19 (Conv3D)          (None, 16, 27, 2, 2)      5776      
                                                                 
 up_sampling3d_8 (UpSampling  (None, 16, 81, 4, 4)     0         
 3D)                                                             
                                                                 
 conv3d_20 (Conv3D)          (None, 3, 81, 4, 4)       2163      
                                                                 
=================================================================
Total params: 19,499
Trainable params: 19,499
Non-trainable params: 0