how to add SOS token to Keras tokenizer?-CodePudding

I have a Keras tokenizer and I want to add a Start of sentence token to my sequences but I could not find anything about it that shows how can I do that?

tokenizer = Tokenizer(split=' ') 

tokenizer.fit_on_texts(data)


tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

text_tokenized = tokenizer.texts_to_sequences(data)


text_corpus_padded = pad_sequences(text_tokenized, padding='post', maxlen=100, dtype='int32')

CodePudding user response：

Depending on your use case (for example, a decoder model), you could add the <sos> and <eos> to each sentence and then tokenize them like this:

import tensorflow as tf

data = ['Hello World', 'Hello New World']
data = ['<sos> '   x   ' <eos>' for x in data]

tokenizer = tf.keras.preprocessing.text.Tokenizer(split=' ', filters='!"#$%&()* ,-./:;=?@[\\]^_`{|}~\t\n') 

tokenizer.fit_on_texts(data)

tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

text_tokenized = tokenizer.texts_to_sequences(data)
print(text_tokenized)
print(tokenizer.word_index)

[[1, 2, 3, 4], [1, 2, 5, 3, 4]]
{'<sos>': 1, 'hello': 2, 'world': 3, '<eos>': 4, 'new': 5, '<pad>': 0}

Note that I have removed < and > from the filters in the Tokenizer so that you can use these characters in your sentences. Also, check this tutorial.