Home > Net >  Why is the length of the word_index greater than num_words?
Why is the length of the word_index greater than num_words?

Time:03-07

I have a code, about text preprocessing for deep learning:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = 10000)
tokenizer.fit_on_texts(X)
tokenizer.word_index

but when I check the length of tokenizer.word_index, safe in the knowledge to get 10000, I get 13233.The length of X is equal to 11541(a dataframe column containing 11541, if it matters to know, however). So my question arises: which is vocabulary size? num_words or the length of word_index? It seems I have confused! Any helps appreciated.

CodePudding user response:

According to the official docs, the argument num_words is,

the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.

word_index will hold all the words which are present in texts. But the difference is observed when you use Tokenizer.texts_to_sequences. For instance, let us consider some sentences,

texts = [
    'hello world' , 
    'hello python' , 
    'python' , 
    'hello java' ,
    'hello java' , 
    'hello python'
]
# Frequency of words, hello -> 5, python -> 3 , java -> 2 , world -> 1
tokenizer = tf.keras.preprocessing.text.Tokenizer( num_words=3 )
tokenizer.fit_on_texts( texts )
print( tokenizer.word_index )

The output of the above snippet will be,

{'hello': 1, 'python': 2, 'java': 3, 'world': 4}

According to the docs, the top num_words-1 words ( based on their frequency ) must only be used while transforming the words to indices. In our case num_words=3 and hence we'd expect the Tokenizer to only use 2 words for the transformation. The two most common words in texts are hello and python. Consider this example to inspect the output of texts_to_sequences

input_seq = [
    'hello' , 
    'hello java' , 
    'hello python' , 
    'hello python java'
]
print( tokenizer.texts_to_sequences( input_seq ) )

The output,

[[1], [1], [1, 2], [1, 2]]

Observe that in the first sentence, hello is encoded as expected. In the second sentence, the word java isn't encoded as it was not included in the vocabulary. In the third sentence, both the words hello and python are included, which is the expected behavior as per our assumption. In the fourth sentence, the word java isn't encoded in the output.

So my question arises: which is vocabulary size? num_words or the length of word_index?

As you might have understood, num_words is the vocab size as only these many words are being encoded in the output. Rest of the words, in our case java and world are simply omitted from the transformation.

  • Related