I am a bit new at Deep learning and image classification. I want to extract features from an image using VGG16 and give them as input to my vit-keras model. Following is my code:
from tensorflow.keras.applications.vgg16 import VGG16
vgg_model = VGG16(include_top=False, weights = 'imagenet', input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3))
for layer in vgg_model.layers:
layer.trainable = False
from vit_keras import vit
vit_model = vit.vit_b16(
image_size = IMAGE_SIZE,
activation = 'sigmoid',
pretrained = True,
include_top = False,
pretrained_top = False,
classes = 2)
model = tf.keras.Sequential([
vgg_model,
vit_model,
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation = tfa.activations.gelu),
tf.keras.layers.Dense(256, activation = tfa.activations.gelu),
tf.keras.layers.Dense(64, activation = tfa.activations.gelu),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(1, 'sigmoid')
],
name = 'vision_transformer')
model.summary()
But, I'm getting the following error:
ValueError: Input 0 of layer embedding is incompatible with the layer: expected axis -1 of input shape to have value 3 but received input with shape (None, 8, 8, 512)
I'm assuming this error occurs at the merging of VGG16 and vit-keras. How will rectify this error for this situation?
CodePudding user response:
You cannot feed the output of the VGG16
model to the vit_model
, since both models expect the input shape (224, 224, 3)
or some shape that you defined. The problem is that the VGG16
model has the output shape (8, 8, 512)
. You could try upsampling / reshaping / resizing the output to fit the expected shape, but I would not recommend it. Instead, just feed the same input to both models and concatenate their results afterwards. Here is a working example:
import tensorflow as tf
import tensorflow_addons as tfa
from vit_keras import vit
IMAGE_SIZE = 224
vgg_model = tf.keras.applications.vgg16.VGG16(include_top=False, weights = 'imagenet', input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3))
for layer in vgg_model.layers:
layer.trainable = False
vit_model = vit.vit_b16(
image_size = IMAGE_SIZE,
activation = 'sigmoid',
pretrained = True,
include_top = False,
pretrained_top = False,
classes = 2)
inputs = tf.keras.layers.Input((IMAGE_SIZE, IMAGE_SIZE, 3))
vgg_output = tf.keras.layers.Flatten()(vgg_model(inputs))
vit_output = vit_model(inputs)
x = tf.keras.layers.Concatenate(axis=-1)([vgg_output, vit_output])
x = tf.keras.layers.Dense(512, activation = tfa.activations.gelu)(x)
x = tf.keras.layers.Dense(256, activation = tfa.activations.gelu)(x)
x = tf.keras.layers.Dense(64, activation = tfa.activations.gelu)(x)
x = tf.keras.layers.BatchNormalization()(x)
outputs = tf.keras.layers.Dense(1, 'sigmoid')(x)
model = tf.keras.Model(inputs, outputs)
print(model.summary())