Tensorflow's seq2seq: tensorflow.python.framework.errors_impl.InvalidArgumentError-CodePudding

I am following quite closely the Seq2seq for translation tutorial here https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt#define_the_optimizer_and_the_loss_function while testing on other data. I meet an error when instantiating the Encoder which is defined as

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    ##-------- LSTM layer in Encoder ------- ##
    self.lstm_layer = tf.keras.layers.LSTM(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, h, c = self.lstm_layer(x, initial_state = hidden)
    return output, h, c

  def initialize_hidden_state(self):
    return [tf.zeros((self.batch_sz, self.enc_units)), tf.zeros((self.batch_sz, self.enc_units))]

It is falling when testing here

# Test Encoder Stack
encoder = Encoder(vocab_size, embedding_dim, units, BATCH_SIZE)

# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_h, sample_c = encoder(example_input_batch, sample_hidden)

The error is the following

Traceback (most recent call last):
  File "C:/Users/Seq2seq/Seq2seq-V3.py", line 132, in <module>
    sample_output, sample_h, sample_c = encoder(example_input_batch, sample_hidden)
  File "C:\Users\AppData\Local\Programs\Python\Python39\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:/Users/Seq2seq/Seq2seq-V3.py", line 119, in call
    x = self.embedding(x)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Exception encountered when calling layer "embedding" (type Embedding).

indices[12,148] = 106 is not in [0, 106) [Op:ResourceGather]

Call arguments received:
  • inputs=tf.Tensor(shape=(64, 200), dtype=int64)

TF 2.0

This might be a problem in TF Addons, would you have some experience with that?

EDIT

the tutorial tokenizes at the word level : I encode the text at the char level and 106 is my vocab_size (number of characters)

CodePudding user response：

This error occurs when you have a sequence which contains integer values outside the range of the defined vocabulary size. You can reproduce your error with the following example, because the vocabulary size of the Embedding layer is 106, meaning sequences can have values between 0 and 105 and I pass a random sequence with values between 0 and 200 to enforce an error:

import tensorflow as tf

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    ##-------- LSTM layer in Encoder ------- ##
    self.lstm_layer = tf.keras.layers.LSTM(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, h, c = self.lstm_layer(x, initial_state = hidden)
    return output, h, c

  def initialize_hidden_state(self):
    return [tf.zeros((self.batch_sz, self.enc_units)), tf.zeros((self.batch_sz, self.enc_units))]

units = 32
BATCH_SIZE = 10
embedding_dim = 20
vocab_size = 106
encoder = Encoder(vocab_size, embedding_dim, units, BATCH_SIZE)
sample_hidden = encoder.initialize_hidden_state()

example_input_batch = tf.random.uniform((10, 15), maxval=201, dtype=tf.int32)
sample_output, sample_h, sample_c = encoder(example_input_batch, sample_hidden)

CodePudding user response：

This is enough of a hint in fact

indices[12,148] = 106 is not in [0, 106) [Op:ResourceGather]

I had to make sure my vocabulary is vocab_size = len(vocab) 1. The dataset construction now goes

text = open(FILE_PATH, 'rb').read().decode(encoding='utf-8') 
vocab = sorted(set(text))

# [...]

vocab_size = len(vocab) 1