Is there a way to define an auto-encoder with different input, output shapes?-CodePudding

All auto-encoders that I have seen have the same input-output shapes. However, I need an auto-encoder with inputs in the shape of (None, 32, 16, 3) [RGB images] and outputs in the shape of (None, 16, 16, 6) [A one-hot encoded representation of the image]. I have tried to use the Keras example for the mnist dataset and adapt it for my use case, but I am getting the following error :

ValueError: `logits` and `labels` must have the same shape, received ((None, 32, 16, 6) vs (None, 16, 16, 6)).

Below is my auto-encoder architecture:

input = layers.Input(shape=(32, 16, 3))

# Encoder
x = layers.Conv2D(32, (3, 3), activation="relu", padding="same")(input)
x = layers.MaxPooling2D((2, 2), padding="same")(x)
x = layers.Conv2D(32, (3, 3), activation="relu", padding="same")(x)
x = layers.MaxPooling2D((2, 2), padding="same")(x)

# Decoder
x = layers.Conv2DTranspose(32, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(32, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2D(6, (3, 3), activation="softmax", padding="same")(x)

# Autoencoder
autoencoder = Model(input, x)
autoencoder.compile(opti
mizer="adam", loss="binary_crossentropy")
autoencoder.summary()

All I care about right now is to make it work without causing errors, and then I will proceed to fix other stuff like the loss function, etc. Is there any way to make this work, or should I just go for other deep learning architectures?

CodePudding user response：

Theoretically, you can build your model exactly as you describe it and end up with a different shape on the output than on the input. In this case, you only have to take into account that your input data is not also suitable as a target for training. In this case, the target must be data that has the same shape as the output of the network.

According to the error message, this is not the case here. The model has an output shape of (None, 32, 16, 6). However, the target is data with the shape (None, 16, 16, 6).

To solve this, the network or its layer would have to be adjusted so that the two shapes fit to of each other. With the output of autoencoder.summary() you can see the shape at the end very well. For example, a possible network that has the right output shape looks like this:

input = layers.Input(shape=(32, 16, 3))
# Encoder
x = layers.Conv2D(32, (3, 3), 
activation="relu", padding="same")(input)
x = layers.MaxPooling2D((2, 2), padding="same")(x)
x = layers.Conv2D(32, (3, 3), activation="relu", padding="same")(x)
x = layers.MaxPooling2D((4, 2), padding="same")(x)

# Decoder
x = layers.Conv2DTranspose(32, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(32, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2D(6, (3, 3), activation="softmax", padding="same")(x)

# Autoencoder
autoencoder = Model(input, x)

autoencoder.summary()