Why am I getting this shape error in my loss function with my TensorFlow NN-CodePudding

Im working on a NLP project to get the emotion from a text. Im using this dataset: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp?select=train.txt

I keep getting this error: logits and labels must have the same first dimension, got logits shape [16,6] and labels shape [96]

My batch size is 16, so the labels shape is the right size since I one hot encoded the outputs and there are 6 possible classes(6*16 = 96). For some reason, the network is changing the shape of the labels and I don't know where this is happening.

Here is my code:

import numpy as np
import pandas as pd
import tensorflow as tf
import os
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn import preprocessing
from keras.utils.np_utils import to_categorical  
from tensorflow.keras import layers
from tensorflow.keras import losses

training_size = 14000
val_size = 1000
BATCH_SIZE = 16

with open('/content/drive/MyDrive/KaggleDatasets/train.txt') as f:
    contents = f.readlines()

split_txt = []
for i in range (len(contents)):
  split_txt.append(contents[i].split(';'))

sentences = []
emotions = []
for i in range (len(contents)):
  sentences.append(split_txt[i][0])
  emotions.append(split_txt[i][1])

labels = np.array(emotions)
labels = labels.astype('str')
unique_labels = np.unique(labels)
print(unique_labels)

label_dict = {
    'anger\n':0,
    'fear\n':1,
    'joy\n':2,
    'love\n':3,
    'sadness\n':4,
    'surprise\n':5
}

#get labels from string to int
int_labels = []
for i in range(len(labels)):
  int_labels.append(label_dict[labels[i]])

catagorical_labels = np.array(to_categorical(int_labels, num_classes = (len(unique_labels))))
sentences=np.array(sentences)

x_train = sentences[0:training_size]
x_val = sentences[training_size:training_size val_size]
x_test = sentences[val_size:]

y_train = catagorical_labels[0:training_size]
y_val = catagorical_labels[training_size:training_size val_size]
y_test = catagorical_labels[val_size:]

tokenizer = Tokenizer(num_words=500, oov_token = "<00V>")
tokenizer.fit_on_texts(x_train)
word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(x_train)
training_padded = pad_sequences(training_sequences, padding='post')
val_sequences = tokenizer.texts_to_sequences(x_val)
val_padded = pad_sequences(val_sequences, padding='post')
test_sequences = tokenizer.texts_to_sequences(x_test)
test_padded = pad_sequences(test_sequences, padding='post')

train_ds = tf.data.Dataset.from_tensor_slices((training_padded, y_train))
val_ds = tf.data.Dataset.from_tensor_slices((val_padded, y_val))
test_ds = tf.data.Dataset.from_tensor_slices((test_padded, y_test))

AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

train_ds = train_ds.batch(batch_size=BATCH_SIZE)
val_ds = val_ds.batch(batch_size=BATCH_SIZE)
test_ds = test_ds.batch(batch_size=BATCH_SIZE)

vocab_size = len(word_index)
embed_dim = 32
max_length = training_padded.shape[1]
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embed_dim, input_length=max_length),
    tf.keras.layers.GlobalMaxPooling1D(),
    tf.keras.layers.Dense(20, activation='relu'),
    tf.keras.layers.Dense(6, activation='softmax')
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

callbacks = [
    tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', patience=2, verbose=1),
    tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5, verbose=1),
]

epochs=3
history = model.fit(
  train_ds,
  epochs=epochs,
  validation_data=val_ds,
  callbacks=callbacks
)

*Im getting the error here during the loss function calculation

CodePudding user response：

In this case, your y_true and y_pred should be of the same shape.

I think you need to reshape your labels to [6, 16] tensor.

Please refer the below documentation to understand the same.

https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy#:~:text=Invokes the Loss instance.

CodePudding user response：

You need to use CategoricalCrossentropy loss instead of SparseCategoricalCrossentropy.

Also, your padded validation sequences have a different length to your train sequences.

You can use the maxlen parameter to make them equal:

val_padded = pad_sequences(val_sequences, padding='post', maxlen=training_padded.shape[-1])