frequency of words in text not present in another text with tf.Tokenizer-CodePudding

I have a text A and a text B. I wish to find the percentage of words in text B (counting all occurrences) not present in the vocabulary (i.e., the list of all unique words) of text A.

E.g.,

A = "a cat and a cat and a mouse"

B = "a plump cat and some more cat and some more mice too"

B has 12 words. Plump, some, more, mice and too are not in A. Plumb is not in A and occurs once, some twice, more twice, mice once, too once. 7 out of 12 words in B are not in A. --> 58 % of B is not in A.

I think we can use Tensorflow's Tokenizer. We can probably also use something else, plain python or another tokenizer and other solutions are welcome.

With tf.Tokenizer I get the word_index for the text A

a_tokenizer = Tokenizer()
a_tokenizer.fit_on_texts(textA) #Builds the word index
word_index=a_tokenizer.word_index

and the word_count for text B

b_tokenizer = Tokenizer()
b_tokenizer.fit_on_texts(generated_text)
word_count=b_tokenizer.word_count

A bad way to achieve this is to go through words in B's word_count and look-up in A

num_words_b=0
num_words_b_in_a=0

for word_b, count_b in tokenizer.word_count.items():
   num_words_b  = count_b

   for word_a, index_a in tokenizer.word_index.items():    

       if word_b == word_a:
           num_words_b_in_a  = count_b
           breaks

and then do 1-num_words_b_in_a/num_words_b. Some more elegant look-up?

EDIT : in fact the above does not work at all, because it tokenizes into characters. I want to tokenize into words.

CodePudding user response：

Maybe try something like this:

import tensorflow as tf

docs1 = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!']


docs2 = ['Well!',
        'work',
        'work',
        'effort',
        'rabbit',
        'nice']

a = tf.keras.preprocessing.text.Tokenizer()
a.fit_on_texts(docs1)

b = tf.keras.preprocessing.text.Tokenizer()
b.fit_on_texts(docs2)

word_index = a.word_index
word_counts = dict(b.word_counts)
diff = set(word_counts.keys()).intersection(set(word_index.keys()))
num_words_b = sum(list(word_counts.values()))
num_words_b_in_a = sum(list(map(word_counts.get, list(diff))))
percentage = 1 - num_words_b_in_a/ num_words_b
print(percentage)

0.16666666666666663