Correct keras LSTM input shape after text-embedding-CodePudding

I'm trying to understand the keras LSTM layer a bit better in regards to timesteps, but am still struggling a bit.

I want to create a model that is able to compare 2 inputs (siamese network). So my input is twice a preprocessed text. The preprocessing is done as followed:

max_len = 64
data['cleaned_text_1'] = assets.apply(lambda x: clean_string(data[]), axis=1)
data['text_1_seq'] = t.texts_to_sequences(cleaned_text_1.astype(str).values)
data['text_1_seq_pad'] = [list(x) for x in pad_sequences(assets['text_1_seq'], maxlen=max_len, padding='post')]

same is being done for the second text input. T is from keras.preprocessing.text.Tokenizer.

I defined the model with:

common_embed = Embedding(
    name="synopsis_embedd",
    input_dim=len(t.word_index) 1,
    output_dim=300,
    input_length=len(data['text_1_seq_pad'].tolist()[0]),
    trainable=True
)

lstm_layer = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2)
)

input1 = tf.keras.Input(shape=(len(data['text_1_seq_pad'].tolist()[0]),))
e1 = common_embed(input1)
x1 = lstm_layer(e1)

input2 = tf.keras.Input(shape=(len(data['text_1_seq_pad'].tolist()[0]),))
e2 = common_embed(input2)
x2 = lstm_layer(e2)

merged = tf.keras.layers.Lambda(
    function=l1_distance, output_shape=l1_dist_output_shape, name='L1_distance'
)([x1, x2])

conc = Concatenate(axis=-1)([merged, x1, x2])

x = Dropout(0.01)(conc)
preds = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=[input1, input2], outputs=preds)

that seems to work if I feed the numpy data with the fit method:

model.fit(
    x = [np.array(data['text_1_seq_pad'].tolist()), np.array(data['text_2_seq_pad'].tolist())],
    y = y_train.values.reshape(-1,1), 
    epochs=epochs,
    batch_size=batch_size,
    validation_data=([np.array(val['text_1_seq_pad'].tolist()), np.array(val['text_2_seq_pad'].tolist())], y_val.values.reshape(-1,1)),
)

What I'm trying to understand at the moment is what is the shape in my case for the LSTM layer for:

samples
time_steps
features

Is it correct that the input_shape for the LSTM layer would be input_shape=(300,1) because I set the embedding output dim to 300 and I have only 1 input feature per LSTM?

And do I need to reshape the embedding output or can I just set

lstm_layer = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(32, input_shape=(300,1), dropout=0.2, recurrent_dropout=0.2)
)

from the embedding output?

Example notebook can be found in Github or as Colab

CodePudding user response：

In general, an LSTM layer needs 3D inputs shaped this way : (batch_size, lenght of an input sequence , number of features ). (Batch size is not really important, so you can just consider that one input need to have this shape (lenght of sequence, number of features par item) )

In your case, the output dim of your embedding layer is 300. So your LSTM have 300 features.

Then, using LSTM on sentences requires a constant number of tokens. LSTM works with constant input dimension, you can not pass it a text with 12 tokens following by another one with 68 tokens. Indeed, you need to fix a limit and pad the sequence if needed. So, if your sentence is 20 tokens long and that your limit is 50, you need to pad (add at the end of your sequence) the sequence with 30 “neutral” tokens (often zeros).

After all, your LSTM input dimension must be (number of token per text, dimension of your embedding outputs) -> (50, 300) in my example.

To learn more about it, it suggest you to take a look to this : (but in your case, you can replace time_steps by number_of_tokens)

https://shiva-verma.medium.com/understanding-input-and-output-shape-in-lstm-keras-c501ee95c65e

Share Edit Delete Flag