I create a tf dataset from a tokenized text converted to sequences then numpy arrays
tokenizer = Tokenizer()
tokenizer.fit_on_texts(bible_text)#Builds the word index
sequences = tokenizer.texts_to_sequences(bible_text)
##-->[[5, 1, 914, 32, 1352, 1, 214, 2, 1, 111],
## [2, 1, 111, 31, 252, 2091, 2, 1874, 2, 547, 31, 38, 1, 196, 3, 1, 899, 2, 1, 298, 3, 32, 878, 38, 1, 196, 3, 1, 266],
## [2, 32, 33, 79, 54, 16, 369, 2, 54, 31, 369], [2, 32, 215, 1, 369, 6, 17, 31, 156, 2, 32, 955, 1, 369, 34, 1, 547], ...]
sequences=pad_sequences(sequences, padding='post')
##-->[[ 5 1 914 32 1352 1 214 2 1 111 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0]
##...]
word_index=tokenizer.word_index
##for k,v in sorted(word_index.items(), key=operator.itemgetter(1))[:10]:
## print (k,v)
##--> the 1
##and 2
##of 3
##to 4
##in 5
##that 6
##shall 7
##he 8
##lord 9
##his 10
##
##[...]
vocab_size = len(tokenizer.word_index) 1
building input and target sequences
input_sequences, target_sequences = sequences[:,:-1], sequences[:,1:]
seq_length=input_sequences.shape[1] ##-->89
num_verses=input_sequences.shape[0]
input_sequences=np.array(input_sequences)
target_sequences=np.array(target_sequences)
and the dataset
dataset= tf.data.Dataset.from_tensor_slices((input_sequences, target_sequences))
Nothing seems particularly wrong with this dataset setup. I define the model here
EPOCHS=2
BATCH_SIZE=256
VAL_FRAC=0.2
LSTM_UNITS=1024
DENSE_UNITS=vocab_size
EMBEDDING_DIM=256
BUFFER_SIZE=10000
len_val=int(num_verses*VAL_FRAC)
#build validation dataset
validation_dataset = dataset.take(len_val)
validation_dataset = (
validation_dataset
.shuffle(BUFFER_SIZE)
.padded_batch(BATCH_SIZE, drop_remainder=True)
.prefetch(tf.data.experimental.AUTOTUNE))
#build training dataset
train_dataset = dataset.skip(len_val)
train_dataset = (
train_dataset
.shuffle(BUFFER_SIZE)
.padded_batch(BATCH_SIZE, drop_remainder=True)
.prefetch(tf.data.experimental.AUTOTUNE))
#build the model: 2 stacked LSTM
print('Build model...')
model = tf.keras.Sequential()
model.add(Embedding(vocab_size, EMBEDDING_DIM))
model.add(LSTM(LSTM_UNITS, return_sequences=True, input_shape=(seq_length, vocab_size)))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(DENSE_UNITS))
model.add(Activation('softmax'))
loss=tf.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer='adam',
loss=loss,
metrics=[
tf.keras.metrics.SparseCategoricalAccuracy()]
)
model.summary()
I get the following error - it falls in the fit method
ValueError: Shape mismatch: The shape of labels (received (16640,)) should equal the shape of logits except for the last dimension (received (256, 3067)).
Any idea, what can be wrong ?
EDIT
If I change to categorical_crossentropy for the loss I get
/usr/local/lib/python3.6/dist-packages/keras/backend.py:4839 categorical_crossentropy
target.shape.assert_is_compatible_with(output.shape)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_shape.py:1161 assert_is_compatible_with
raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (256, 65) and (256, 3067) are incompatible
EDIT
I used the model indicated by AloneTogether and this solves the fitting step. But I run into a problem when doing the prediction on new data
preds = model.predict(x, verbose=0)[0][0]
since predictions do not sum exactly to one
>>> preds
array([1.6435336e-04, 1.4827750e-04, 1.4495676e-04, ..., 8.9204557e-05,
8.9799374e-05, 8.7148059e-05], dtype=float32)
>>> sum(preds)
1.0000000457002898
which seems to be why I then can't sample from this 'distribution'
def sample(a, temperature=1.0):
#helper function to sample an index from a probability array
a = np.log(a) / temperature
a = np.exp(a) / np.sum(np.exp(a))
return np.argmax(np.random.multinomial(1, a, 1))
Any clue why this behavior, any workaround?
CodePudding user response:
Your preprocessing steps seem fine. Assuming you want to generate a sequence as your output (your targets are sequences), try adjusting your model as follows:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, EMBEDDING_DIM))
model.add(tf.keras.layers.LSTM(LSTM_UNITS, return_sequences=True))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.LSTM(512, return_sequences=True))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(DENSE_UNITS, activation='softmax')))
Note that your last LSTM
layer now returns the sequences again. Th time-distributed layer simply applies a fully connected layer with a softmax activation function to each time step i
to calculate the probability for each word in the vocabulary. The number of nodes used for each fully connected layer is equal to the size of the vocabulary in order to give each word a fair chance of being predicted.
To sample from the distribution based on some input, you can do the following:
temperature = 1.0
sample = input_sequences[0] # "You are unsure whether or not to trust him but very thankful that you wore a turtle neck"
sample = tf.expand_dims(sample, axis=0)
predictions = model.predict(sample) / temperature
index_word=tokenizer.index_word
predictions = tf.squeeze(predictions, axis=0)
sampled_indices = tf.random.categorical(predictions, num_samples=1)
word_list = list(np.vectorize(index_word.get)(sampled_indices))
print(sampled_indices)
print(word_list)
'''
tf.Tensor(
[[ 7]
[45]
[52]
[41]
[29]
[21]
[21]
[35]
[27]
[ 6]
[38]
[44]
[25]
[39]
[13]
[19]
[26]], shape=(17, 1), dtype=int64)
[array(['about'], dtype='<U7'), array(['thorns'], dtype='<U7'), array(['would'], dtype='<U7'), array(['is'], dtype='<U7'), array(['but'], dtype='<U7'), array(['by'], dtype='<U7'), array(['by'], dtype='<U7'), array(['all'], dtype='<U7'), array(['to'], dtype='<U7'), array(['she'], dtype='<U7'), array(['wander'], dtype='<U7'), array(['have'], dtype='<U7'), array(['whether'], dtype='<U7'), array(['lost'], dtype='<U7'), array(['are'], dtype='<U7'), array(['your'], dtype='<U7'), array(['or'], dtype='<U7')]
'''
Of course, the model I trained will spit out gibberish since it was trained on 10 samples for 2 epochs, but hopefully you get the idea. I employ a sampler function (tf.random.categorical
) to sample from the multinomial distribution produced by the temperature-weighted softmax function at each time step. For example, let w
be the probability distribution at timestep 1 based on the vocabulary v
. The sampler function takes w
and draws one integer value representing a word with a high probability in this multinomial distribution. I hope you get the idea.