Prediction with keras embedding leads to indices not in list-CodePudding

I have a model that I trained with

common_embed = Embedding(
    name="synopsis_embedd",
    input_dim =len(t.word_index) 1,
    output_dim=len(embeddings_index['no']),
    weights=[embedding_matrix],
    input_length=len(X_train['asset_text_seq_pad'].tolist()[0]),
    trainable=True
)

lstm_1 = common_embed(input_1)
common_lstm = LSTM(64, input_shape=(100,2))
...

For the embedding I use Glove as a pre-trained embedding dictionary. Where I first build the tokenizer and text sequence with: t = Tokenizer() t.fit_on_texts(all_text)

text_seq= pad_sequences(t.texts_to_sequences(data['example_texts'].astype(str).values))

and then I'm calculating the embedding matrix with:

embeddings_index = {}
for line in new_byte_string.decode('utf-8').split('\n'):
  if line:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs


embedding_vector = None
not_present_list = []
vocab_size = len(t.word_index)   1
print('Loaded %s word vectors.' % len(embeddings_index))
embedding_matrix = np.zeros((vocab_size, len(embeddings_index['no'])))
for word, i in t.word_index.items():
    if word in embeddings_index.keys():
        embedding_vector = embeddings_index.get(word)
    else:
        not_present_list.append(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    else:
        embedding_matrix[i] = np.zeros(300)

now I'm using a new dataset for the prediction. This leads to an error:

Node: 'model/synopsis_embedd/embedding_lookup' indices[38666,63] = 136482 is not in [0, 129872) [[{{node model/synopsis_embedd/embedding_lookup}}]] [Op:__inference_predict_function_12452]

I do all of the steps for the prediction step again. Is that wrong and do I have to reuse the tokenizer from the training? Or why are the indices during prediction not existing?

CodePudding user response：

You are probably getting this error because you are not using the same tokenizer and embedding_matrix during inference. Here is an example:

import tensorflow as tf

vocab_size = 50
embedding_layer = tf.keras.layers.Embedding(vocab_size, 64, input_length=10)

sequence1 = tf.constant([[1, 2, 5, 10, 32]])
embedding_layer(sequence1) # This works

sequence2 = tf.constant([[51, 2, 5, 10, 32]])
embedding_layer(sequence2) # This throws an error because 51 is larger than the vocab_size=50