Im working on a NLP project to get the emotion from a text. Im using this dataset: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp?select=train.txt
I keep getting this error: logits and labels must have the same first dimension, got logits shape [16,6] and labels shape [96]
My batch size is 16, so the labels shape is the right size since I one hot encoded the outputs and there are 6 possible classes(6*16 = 96). For some reason, the network is changing the shape of the labels and I don't know where this is happening.
Here is my code:
import numpy as np
import pandas as pd
import tensorflow as tf
import os
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn import preprocessing
from keras.utils.np_utils import to_categorical
from tensorflow.keras import layers
from tensorflow.keras import losses
training_size = 14000
val_size = 1000
BATCH_SIZE = 16
with open('/content/drive/MyDrive/KaggleDatasets/train.txt') as f:
contents = f.readlines()
split_txt = []
for i in range (len(contents)):
split_txt.append(contents[i].split(';'))
sentences = []
emotions = []
for i in range (len(contents)):
sentences.append(split_txt[i][0])
emotions.append(split_txt[i][1])
labels = np.array(emotions)
labels = labels.astype('str')
unique_labels = np.unique(labels)
print(unique_labels)
label_dict = {
'anger\n':0,
'fear\n':1,
'joy\n':2,
'love\n':3,
'sadness\n':4,
'surprise\n':5
}
#get labels from string to int
int_labels = []
for i in range(len(labels)):
int_labels.append(label_dict[labels[i]])
catagorical_labels = np.array(to_categorical(int_labels, num_classes = (len(unique_labels))))
sentences=np.array(sentences)
x_train = sentences[0:training_size]
x_val = sentences[training_size:training_size val_size]
x_test = sentences[val_size:]
y_train = catagorical_labels[0:training_size]
y_val = catagorical_labels[training_size:training_size val_size]
y_test = catagorical_labels[val_size:]
tokenizer = Tokenizer(num_words=500, oov_token = "<00V>")
tokenizer.fit_on_texts(x_train)
word_index = tokenizer.word_index
training_sequences = tokenizer.texts_to_sequences(x_train)
training_padded = pad_sequences(training_sequences, padding='post')
val_sequences = tokenizer.texts_to_sequences(x_val)
val_padded = pad_sequences(val_sequences, padding='post')
test_sequences = tokenizer.texts_to_sequences(x_test)
test_padded = pad_sequences(test_sequences, padding='post')
train_ds = tf.data.Dataset.from_tensor_slices((training_padded, y_train))
val_ds = tf.data.Dataset.from_tensor_slices((val_padded, y_val))
test_ds = tf.data.Dataset.from_tensor_slices((test_padded, y_test))
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
train_ds = train_ds.batch(batch_size=BATCH_SIZE)
val_ds = val_ds.batch(batch_size=BATCH_SIZE)
test_ds = test_ds.batch(batch_size=BATCH_SIZE)
vocab_size = len(word_index)
embed_dim = 32
max_length = training_padded.shape[1]
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embed_dim, input_length=max_length),
tf.keras.layers.GlobalMaxPooling1D(),
tf.keras.layers.Dense(20, activation='relu'),
tf.keras.layers.Dense(6, activation='softmax')
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
callbacks = [
tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', patience=2, verbose=1),
tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5, verbose=1),
]
epochs=3
history = model.fit(
train_ds,
epochs=epochs,
validation_data=val_ds,
callbacks=callbacks
)
*Im getting the error here during the loss function calculation
CodePudding user response:
In this case, your y_true and y_pred should be of the same shape.
I think you need to reshape your labels to [6, 16] tensor.
Please refer the below documentation to understand the same.
https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy#:~:text=Invokes the Loss instance.
CodePudding user response:
You need to use CategoricalCrossentropy
loss instead of SparseCategoricalCrossentropy
.
Also, your padded validation sequences have a different length to your train sequences.
You can use the maxlen
parameter to make them equal:
val_padded = pad_sequences(val_sequences, padding='post', maxlen=training_padded.shape[-1])