I have a model that I trained with
common_embed = Embedding(
name="synopsis_embedd",
input_dim =len(t.word_index) 1,
output_dim=len(embeddings_index['no']),
weights=[embedding_matrix],
input_length=len(X_train['asset_text_seq_pad'].tolist()[0]),
trainable=True
)
lstm_1 = common_embed(input_1)
common_lstm = LSTM(64, input_shape=(100,2))
...
For the embedding I use Glove as a pre-trained embedding dictionary. Where I first build the tokenizer and text sequence with: t = Tokenizer() t.fit_on_texts(all_text)
text_seq= pad_sequences(t.texts_to_sequences(data['example_texts'].astype(str).values))
and then I'm calculating the embedding matrix with:
embeddings_index = {}
for line in new_byte_string.decode('utf-8').split('\n'):
if line:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
embedding_vector = None
not_present_list = []
vocab_size = len(t.word_index) 1
print('Loaded %s word vectors.' % len(embeddings_index))
embedding_matrix = np.zeros((vocab_size, len(embeddings_index['no'])))
for word, i in t.word_index.items():
if word in embeddings_index.keys():
embedding_vector = embeddings_index.get(word)
else:
not_present_list.append(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
else:
embedding_matrix[i] = np.zeros(300)
now I'm using a new dataset for the prediction. This leads to an error:
Node: 'model/synopsis_embedd/embedding_lookup' indices[38666,63] = 136482 is not in [0, 129872) [[{{node model/synopsis_embedd/embedding_lookup}}]] [Op:__inference_predict_function_12452]
I do all of the steps for the prediction step again. Is that wrong and do I have to reuse the tokenizer from the training? Or why are the indices during prediction not existing?
CodePudding user response:
You are probably getting this error because you are not using the same tokenizer
and embedding_matrix
during inference. Here is an example:
import tensorflow as tf
vocab_size = 50
embedding_layer = tf.keras.layers.Embedding(vocab_size, 64, input_length=10)
sequence1 = tf.constant([[1, 2, 5, 10, 32]])
embedding_layer(sequence1) # This works
sequence2 = tf.constant([[51, 2, 5, 10, 32]])
embedding_layer(sequence2) # This throws an error because 51 is larger than the vocab_size=50