I followed a tutorial about tokenizing sentences using Tensorflow, here's the code I'm trying:
from tensorflow.keras.preprocessing.text import Tokenizer #API for tokenization
t = Tokenizer(num_words=4) #meant to catch most imp _
listofsentences=['Apples are fruits', 'An orange is a tasty fruit', 'Fruits are tasty!']
t.fit_on_texts(listofsentences) #processes words
print(t.word_index)
print(t.texts_to_sequences(listofsentences)) #arranges tokens, returns nested list
The first print statement shows a dictionary as expected:
{'are': 1, 'fruits': 2, 'tasty': 3, 'apples': 4, 'an': 5, 'orange': 6, 'is': 7, 'a': 8, 'fruit': 9}
But the last line outputs a list that misses many words:
[[1, 2], [3], [2, 1, 3]]
Please let me know what I'm doing wrong and how to get the expected list:
[[4,1,2],[5,6,7,8,3,9],[2,1,3]]
CodePudding user response:
To specify an unlimited amount of tokens use:
t = Tokenizer(num_words=None)
Output:
{'are': 1, 'fruits': 2, 'tasty': 3, 'apples': 4, 'an': 5, 'orange': 6, 'is': 7, 'a': 8, 'fruit': 9}
[[4, 1, 2], [5, 6, 7, 8, 3, 9], [2, 1, 3]]