I am a using CLIP model. Where I have two models. One model output is (20, 128, 256)
and the other one output is (20, 256)
.
image_model_output = (20, 256)
text_model_output = (20, 128, 256)
I use the following to calculate this
logits = (tf.matmul(caption_embeddings, image_embeddings, transpose_b=True))
so it will be like `(20, 256) * (256, 128, 20)`
it's ouput will be `(20, 128, 20)`
Similarly I calculate like this
images_similarity = tf.matmul(
image_embeddings, image_embeddings, transpose_b=True
)
(Output)--> (20, 256) * (256, 20) = (20,20)
and this
captions_similarity = tf.matmul(
caption_embeddings, caption_embeddings, transpose_b=True
)
(Output)--> (20, 128, 256) * (256, 128, 20) = (20, 128, 128)
The problem arises here
targets = keras.activations.softmax(
(captions_similarity images_similarity) / (2 * self.temperature)
)
So do I need to change the activation function or there is any way to add these 3d matrices with different shapes? Sorry to technically explain like this but people with solid deep learning and machine learning backgorund will understand.
NOTE: After adding axis 1
like this tf.expand_dims(image_embeddings, axis=1)
the below part runs successfully
targets = keras.activations.softmax(
(captions_similarity images_similarity) / (2 * self.temperature)
)
However after this there is a loss funtion like below
captions_loss = keras.losses.categorical_crossentropy(
y_true=targets, y_pred=logits, from_logits=True
)
which generates this error
ValueError: Shapes (2, 128, 128) and (2, 128, 1) are incompatible
Is it possible to solve this error?
CodePudding user response:
To handle the above error I used a different loss funtion. I changed the code like below.
captions_loss = keras.losses.categorical_crossentropy(
y_true=targets, y_pred=logits, from_logits=True
)
To
captions_loss = keras.losses.kl_divergence(
y_true=targets, y_pred=logits
)
To save time of developers I have answered to my own. I am available to discuss on it further if someone is interested.