I'm doing Amazon review sentiment analysis with RNN and LSTM. df2['Texts'] are Amazon customer reviews, and df2['label'] are binary integer 0 or 1.
tokenizer = Tokenizer(num_words=5000, split=' ')
tokenizer.fit_on_texts(df2['Text'].values)
encoded_docs = tokenizer.texts_to_sequences(df2['Text'].values)
X = pad_sequences(encoded_docs, maxlen = 1000)
X.shape # (3872, 1000)
y = df2['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
This is my model:
model = tf.keras.Sequential()
model.add(Embedding(1000, 64, input_length = X.shape[1]))
model.add(LSTM(176, dropout=0.4, recurrent_dropout=0.4))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
print(model.summary())
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=['accuracy'])
batch_size=128
history = model.fit(X_train, y_train, epochs=13, batch_size=batch_size, validation_data=(X_test, y_test))
The validation accuracy for the last epoch is around 0.86.
And then I tried to predict the result of a text:
def anal_sent(my_text, my_model, my_tokenizer):
encoded_text = my_tokenizer.texts_to_sequences(my_text)
X = pad_sequences(encoded_text, maxlen = 1000)
return (my_model.predict(X))
ex_review = "I bought it for my son and he says he likes it."
print(anal_sent(ex_review, model, tokenizer)) # this tokenizer is what I used for training dataset.
But the output is an array like [[0.73], [0.68], ...] instead of 0 or 1.
Is there anything wrong? What's the correct way to make prediction?
CodePudding user response:
texts_to_sequences
should receive a list of texts. Otherwise it will interpret each word as a sentence. Try this:
ex_review = ["I bought it for my son and he says he likes it."]