I am following quite closely the Seq2seq for translation tutorial here https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt#define_the_optimizer_and_the_loss_function while testing on other data. I meet an error when instantiating the Encoder which is defined as
class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
##-------- LSTM layer in Encoder ------- ##
self.lstm_layer = tf.keras.layers.LSTM(self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
def call(self, x, hidden):
x = self.embedding(x)
output, h, c = self.lstm_layer(x, initial_state = hidden)
return output, h, c
def initialize_hidden_state(self):
return [tf.zeros((self.batch_sz, self.enc_units)), tf.zeros((self.batch_sz, self.enc_units))]
It is falling when testing here
# Test Encoder Stack
encoder = Encoder(vocab_size, embedding_dim, units, BATCH_SIZE)
# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_h, sample_c = encoder(example_input_batch, sample_hidden)
The error is the following
Traceback (most recent call last):
File "C:/Users/Seq2seq/Seq2seq-V3.py", line 132, in <module>
sample_output, sample_h, sample_c = encoder(example_input_batch, sample_hidden)
File "C:\Users\AppData\Local\Programs\Python\Python39\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:/Users/Seq2seq/Seq2seq-V3.py", line 119, in call
x = self.embedding(x)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Exception encountered when calling layer "embedding" (type Embedding).
indices[12,148] = 106 is not in [0, 106) [Op:ResourceGather]
Call arguments received:
• inputs=tf.Tensor(shape=(64, 200), dtype=int64)
TF 2.0
This might be a problem in TF Addons, would you have some experience with that?
EDIT
the tutorial tokenizes at the word level : I encode the text at the char level and 106 is my vocab_size
(number of characters)
CodePudding user response:
This error occurs when you have a sequence which contains integer values outside the range of the defined vocabulary size. You can reproduce your error with the following example, because the vocabulary size of the Embedding
layer is 106, meaning sequences can have values between 0 and 105 and I pass a random sequence with values between 0 and 200 to enforce an error:
import tensorflow as tf
class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
##-------- LSTM layer in Encoder ------- ##
self.lstm_layer = tf.keras.layers.LSTM(self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
def call(self, x, hidden):
x = self.embedding(x)
output, h, c = self.lstm_layer(x, initial_state = hidden)
return output, h, c
def initialize_hidden_state(self):
return [tf.zeros((self.batch_sz, self.enc_units)), tf.zeros((self.batch_sz, self.enc_units))]
units = 32
BATCH_SIZE = 10
embedding_dim = 20
vocab_size = 106
encoder = Encoder(vocab_size, embedding_dim, units, BATCH_SIZE)
sample_hidden = encoder.initialize_hidden_state()
example_input_batch = tf.random.uniform((10, 15), maxval=201, dtype=tf.int32)
sample_output, sample_h, sample_c = encoder(example_input_batch, sample_hidden)
CodePudding user response:
This is enough of a hint in fact
indices[12,148] = 106 is not in [0, 106) [Op:ResourceGather]
I had to make sure my vocabulary is vocab_size = len(vocab) 1
. The dataset construction now goes
text = open(FILE_PATH, 'rb').read().decode(encoding='utf-8')
vocab = sorted(set(text))
# [...]
vocab_size = len(vocab) 1