I have a text A and a text B. I wish to find the percentage of words in text B (counting all occurrences) not present in the vocabulary (i.e., the list of all unique words) of text A.
E.g.,
A = "a cat and a cat and a mouse"
B = "a plump cat and some more cat and some more mice too"
B has 12 words. Plump, some, more, mice and too are not in A. Plumb is not in A and occurs once, some twice, more twice, mice once, too once. 7 out of 12 words in B are not in A. --> 58 % of B is not in A.
I think we can use Tensorflow's Tokenizer. We can probably also use something else, plain python or another tokenizer and other solutions are welcome.
With tf.Tokenizer I get the word_index
for the text A
a_tokenizer = Tokenizer()
a_tokenizer.fit_on_texts(textA) #Builds the word index
word_index=a_tokenizer.word_index
and the word_count
for text B
b_tokenizer = Tokenizer()
b_tokenizer.fit_on_texts(generated_text)
word_count=b_tokenizer.word_count
A bad way to achieve this is to go through words in B's word_count
and look-up in A
num_words_b=0
num_words_b_in_a=0
for word_b, count_b in tokenizer.word_count.items():
num_words_b = count_b
for word_a, index_a in tokenizer.word_index.items():
if word_b == word_a:
num_words_b_in_a = count_b
breaks
and then do 1-num_words_b_in_a/num_words_b
. Some more elegant look-up?
EDIT : in fact the above does not work at all, because it tokenizes into characters. I want to tokenize into words.
CodePudding user response:
Maybe try something like this:
import tensorflow as tf
docs1 = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!']
docs2 = ['Well!',
'work',
'work',
'effort',
'rabbit',
'nice']
a = tf.keras.preprocessing.text.Tokenizer()
a.fit_on_texts(docs1)
b = tf.keras.preprocessing.text.Tokenizer()
b.fit_on_texts(docs2)
word_index = a.word_index
word_counts = dict(b.word_counts)
diff = set(word_counts.keys()).intersection(set(word_index.keys()))
num_words_b = sum(list(word_counts.values()))
num_words_b_in_a = sum(list(map(word_counts.get, list(diff))))
percentage = 1 - num_words_b_in_a/ num_words_b
print(percentage)
0.16666666666666663