Home > Software engineering >  Prediction with keras embedding leads to indices not in list
Prediction with keras embedding leads to indices not in list

Time:07-20

I have a model that I trained with

common_embed = Embedding(
    name="synopsis_embedd",
    input_dim =len(t.word_index) 1,
    output_dim=len(embeddings_index['no']),
    weights=[embedding_matrix],
    input_length=len(X_train['asset_text_seq_pad'].tolist()[0]),
    trainable=True
)

lstm_1 = common_embed(input_1)
common_lstm = LSTM(64, input_shape=(100,2))
...

For the embedding I use Glove as a pre-trained embedding dictionary. Where I first build the tokenizer and text sequence with: t = Tokenizer() t.fit_on_texts(all_text)

text_seq= pad_sequences(t.texts_to_sequences(data['example_texts'].astype(str).values))

and then I'm calculating the embedding matrix with:

embeddings_index = {}
for line in new_byte_string.decode('utf-8').split('\n'):
  if line:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs


embedding_vector = None
not_present_list = []
vocab_size = len(t.word_index)   1
print('Loaded %s word vectors.' % len(embeddings_index))
embedding_matrix = np.zeros((vocab_size, len(embeddings_index['no'])))
for word, i in t.word_index.items():
    if word in embeddings_index.keys():
        embedding_vector = embeddings_index.get(word)
    else:
        not_present_list.append(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    else:
        embedding_matrix[i] = np.zeros(300)

now I'm using a new dataset for the prediction. This leads to an error:

Node: 'model/synopsis_embedd/embedding_lookup' indices[38666,63] = 136482 is not in [0, 129872) [[{{node model/synopsis_embedd/embedding_lookup}}]] [Op:__inference_predict_function_12452]

I do all of the steps for the prediction step again. Is that wrong and do I have to reuse the tokenizer from the training? Or why are the indices during prediction not existing?

CodePudding user response:

You are probably getting this error because you are not using the same tokenizer and embedding_matrix during inference. Here is an example:

import tensorflow as tf

vocab_size = 50
embedding_layer = tf.keras.layers.Embedding(vocab_size, 64, input_length=10)

sequence1 = tf.constant([[1, 2, 5, 10, 32]])
embedding_layer(sequence1) # This works

sequence2 = tf.constant([[51, 2, 5, 10, 32]])
embedding_layer(sequence2) # This throws an error because 51 is larger than the vocab_size=50
  • Related