Home > Software engineering >  Keras RNN, incorrect shape of input even though shape is shown as correct
Keras RNN, incorrect shape of input even though shape is shown as correct

Time:07-30

I am attempting to train an RNN to classify texts. On my computer, i have a large text file with all of the phrases to train the network on for each category (2 total), e.g

Phrase 1
Phrase 2
Phrase 3

I then turned that into a keras dataset using

tf.data.TextLineDataset(
directory)

This had no labels attatched to the items so i used the function

directory.map(lambda ex: labeler(ex,2))

which added labels to all of the items, leaving a dataset which looked like this:

<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>

This was then split into validation and training sets using .skip and .take. Both the categories were then combined into a single validation and a single training dataset using category1 = category1.concatenate(category2)

I then created a vectorization layer which looked like this:

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)

  return vectorize_layer(text), label

and ran the training and validation sets through the function to vectorize all of the phrases. This then left a dataset which looked like this:

<MapDataset element_spec=(TensorSpec(shape=(None, 250), dtype=tf.int64, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>

and an example of an item would be

(<tf.Tensor: shape=(1, 250), dtype=int64, numpy=array([[   1,   28,   12, 1199, 3445,   61,   31,  166,  163,   13,   28,
           2,   97,   13,    6,  206,  625,  972,  344,    7, 2790,   11,
           1, 1379, 3615,   24,    1,    2,   27,   21,    3,  435,    4,
          16,    1,   15,   22,    1,    3,  127,    2,   13,   36,    8,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0]], dtype=int64)>, <tf.Tensor: shape=(), dtype=int64, numpy=2>),

As you can see, the shape of the item is 1,250 and there is another tensor which represents the category it falls into, in this case #2. I then fed it through my model where things broke. The model was this:

model = keras.Sequential()
model.add(keras.layers.LSTM(128, input_shape=(1,250,), activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(128, activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(32, activation="relu", return_sequences=False))
model.add(keras.layers.Dense(1,activation="relu"))

model.compile(optimizer=tf.keras.optimizers.Adam(0.01), loss='binary_crossentropy', metrics=['accuracy','Precision','Recall'])
model.fit(train_set,batch_size=32,epochs=1)

but when i ran the code, i got the error

 ValueError: Input 0 of layer "sequential_54" is incompatible with the layer: expected shape=(None, 1, 250), found shape=(None, 250)

To solve this i tried to add in a reshaping layer, which didnt work. I also tried to use np.expand_dims, but that also couldn't solve the problem. Does anyone have a solution? Also, some functions such as train_set.shape give errors such as ConcatenateDataset object has no attribute shape

Edit: the data pre-processing
splitting and extracting


def labeler(example, index):  #function to label items 
  return example, tf.cast(index, tf.int64)

train_set_1 = tf.data.TextLineDataset( #get data 
    "comments1.txt",
    compression_type=None,
    buffer_size=None,
    num_parallel_reads=None,
    name=None
)

#split data into val and training 

val_set_1 = train_set_1.skip(int(1200*8/10)) #1200 is the number of items so 80:20 split
train_set_1 = train_set_1.take(int(1200*8/10))
#label both sets 
labeled_train_1 = train_set_1.map(lambda ex: labeler(ex, 1)) #1 is the label
labeled_val_1 = val_set_1.map(lambda ex: labeler(ex, 1))
print(labeled_train_1)


print(train_set_1)

#repeat for set 2 
train_set_2 = tf.data.TextLineDataset(
    "comments2.txt",
    compression_type=None,
    buffer_size=None,
    num_parallel_reads=None,
    name=None
)
val_set_2 = train_set_2.skip(int(1200*8/10))
train_set_2= train_set_2.take(int(1200*8/10))
labeled_train_2 = train_set_2.map(lambda ex: labeler(ex, 2))
labeled_val_2 = val_set_2.map(lambda ex: labeler(ex, 2))


vectorization

#len(counter) is total number of words, max_length is 250
vectorize_layer = tf.keras.layers.TextVectorization( max_tokens=len(counter), output_mode='int', output_sequence_length=max_length)
vectorize_layer.adapt(train_set_1)
vectorize_layer.adapt(train_set_2)  


def vectorize_text(text, label): #this is where i can change the dimensions and vectorize the whole sequence 
  text = tf.expand_dims(text, 0)
  text = tf.expand_dims(text, -1)

  return vectorize_layer(text), label

#actually vectorizing and combining all the text
train_1= labeled_train_1.map(vectorize_text)
train_2 = labeled_train_2.map(vectorize_text)
train_set = train_1.concatenate(train_2)
val_1 = labeled_val_1.map(vectorize_text)
val_2 = labeled_val_2.map(vectorize_text)
val_set = val_1.concatenate(val_2)
print(val_2)
print(list(val_2))

next it is just fed into the model.
Edit 2:
I found a notebook doing a similar thing to my project, and it seems to use an embedding layer in the neural network, so i think that a embedding layer might help. I have experimented with it and using different configurations of expand dims, but still have no solution, but the layer might be useful.

This is what i am currently messing about with:

model.add(keras.layers.Embedding(len(counter),250,input_length=1))
model.add(keras.layers.LSTM(128, activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(128, activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(32, activation="relu", return_sequences=False))
model.add(keras.layers.Dense(1,activation="relu"))

but i have also tried len(counter),1,input_length=250.
Edit 3: i have managed to change the dimensionality to 250,1 instead of 1,250, but i get an error message that for the fit loop the input shape is none,1,1. This seems like the problem might be that the input is both the tokenized words, which is a tensor of size 250,1, and also the answer aka dataset 1 or dataset 2, which is another tensor, leading to a tensor containing 2 tensors, which might give a size of none,1,1.

CodePudding user response:

you just need to expand your dims, even though you might be missing a step, but I'll talk about that later... the fix should be:

model.fit(tf.expand_dim(train_set, 1),batch_size=32,epochs=1)

now, about the dims:
An RNN expects a single element to be in the shape (None, X) with X a positive integer.
The first None represent the length of your phrase/sequence, since it may vary, it uses None to avoid having to fix it manually.
The second dimension, X represents the "feature" of an element in your sequence. Take for example weather forecast, those features are the wind speed, the humidity and so on.

Having said this, your sequences should be encoded as (250, 1) since you have "250 words/elements in the sequence", and each word has 1 feature (the integer corresponding to it).

Given this, in my opinion, you should use the following:

model = keras.Sequential()
model.add(keras.layers.LSTM(128, input_shape=(250,1), activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(128, activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(32, activation="relu", return_sequences=False))
model.add(keras.layers.Dense(1,activation="relu"))

model.compile(optimizer=tf.keras.optimizers.Adam(0.01), loss='binary_crossentropy', metrics=['accuracy','Precision','Recall'])
model.fit(tf.expand_dim(train_set, -1),batch_size=32,epochs=1)

you can see this written in the doc in this page:

Call arguments
inputs: A 3D tensor with shape [batch, timesteps, feature].

CodePudding user response:

I found the solution. Problem was that there was a bug in the way that a vectorize layer works, which would cause it to sometimes return an empty array instead of one padded. Because of this, i had to convert the dataset to an array using

def dataset_to_numpy(ds):

    #Convert tensorflow dataset to numpy arrays

    texts = []
    labels = []

    # Iterate over a dataset
    for i, (text, label) in enumerate(tfds.as_numpy(ds)):
        texts.append(text)
        labels.append(label)

    for i, txt in enumerate(texts):
        if i < 3:
            print(txt.shape, labels[i])

    return texts, labels

I ran the val and training set through the function and then used this function

to_del = [] 
  for i in range(len(train_set[0])):
    if train_set[0][i].shape != (250, 1):
      print(i)
      to_del.append(i)

which got the items to delete. I then deleted those items in the array and ran this code


train_set[0] = list(train_set[0])
train_set[1] = list(train_set[1])
train_set[0] = np.array([np.array(val) for val in train_set[0]])
train_set[1] = np.array([np.array(val) for val in train_set[1]])
train_set[1] = np.expand_dims(train_set[1],-1) 

which turned the lists into np arrays and also expanded the dimensions for the targets. Finally, i converted train_set to x_train and y_train and fed it into the network where it started to train.

  • Related