I have the following preprocessing for a tensorflow neural-network:
import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow.keras.layers import Input,Dense,LSTM,Flatten,GlobalAveragePooling1D,Embedding,Dropout
!wget --no-check-certificate \
https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv \
-O /tmp/bbc-text.csv
# Stopwords list from https://github.com/Yoast/YoastSEO.js/blob/develop/src/config/stopwords.js
# Convert it to a Python list and paste it here
stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at",
"be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do",
"does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having",
"he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how",
"how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself",
"let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought",
"our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should",
"so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then",
"there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through",
"to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were",
"what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why",
"why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself",
"yourselves"]
#----------------------------------- Ream from Csv and remove the stopwords
sentences = []
labels = []
with open("/tmp/bbc-text.csv", 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader)
for row in reader:
labels.append(row[0])
sentence = row[1]
for word in stopwords:
token = " " word " "
sentence = sentence.replace(token, " ")
sentence = sentence.replace(" ", " ")
sentences.append(sentence)
#---------------------------------- Tokenize sentences
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding = 'post')
#--------------------------------- Tokenize labels
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
# label_word_index = label_tokenizer.word_index
label_seq = label_tokenizer.texts_to_sequences(labels)`
and finally here is the neural network which works by the prepared data:
train_sentence = tf.convert_to_tensor(padded,tf.int32)
train_label = tf.convert_to_tensor(label_seq,tf.int32)
input = Input(shape=(2441,))
x = Embedding(input_dim=10000,output_dim=128)(input)
x = LSTM(64,return_sequences=True)(x)
x = LSTM(64,return_sequences=True)(x)
x = LSTM(64,return_sequences=True)(x)
x = Dropout(0.2)(x)
x = LSTM(64)(x)
x = Flatten()(x)
output = Dense(5, activation='softmax')(x)
model = tf.keras.models.Model(input,output)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x=train_sentence,y=train_label,epochs=10)
But, it fails with the following error:
InvalidArgumentError: Graph execution error:
CodePudding user response:
The input_dim
of the Embedding
layer has to correspond to the size of your data's vocabulary 1. Also, your labels should begin from zero and not from one when using the sparse_categorical_crossentropy
loss function. Here is a working example based on your code and data:
# ...
# ...
train_sentence = tf.convert_to_tensor(padded,tf.int32)
train_label = tf.convert_to_tensor(label_seq,tf.int32)
train_label = train_label - 1
input = Input(shape=(2441,))
x = Embedding(input_dim=len(tokenizer.word_index) 1,output_dim=128)(input)
x = LSTM(64,return_sequences=True)(x)
x = LSTM(64,return_sequences=True)(x)
x = LSTM(64,return_sequences=True)(x)
x = Dropout(0.2)(x)
x = LSTM(64)(x)
x = Flatten()(x)
output = Dense(5, activation='softmax')(x)
model = tf.keras.models.Model(input,output)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x=train_sentence,y=train_label,epochs=10)