Home > Blockchain >  Problem with Py_stringmatching GeneralizedJaccard
Problem with Py_stringmatching GeneralizedJaccard

Time:12-22

I'm using GeneralizedJaccard from Py_stringmatching package to measure the similarity between two strings. According to this document:

... If the similarity of a token pair exceeds the threshold, then the token pair is considered a match ...

For example for word pair 'method' and 'methods' we have:

print(sm.Levenshtein().get_sim_score('method','methods'))
>>0.8571428571428572

The similarity between example word pair is 0.85 and greater than 0.80 ,So this pair must considered a match and I expect that the final GeneralizedJaccard output for two near-duplicate sentences to be equal to 1 but it's 0.97:

import py_stringmatching as sm

str1='All tokenizers have a tokenize method'
str2='All tokenizers have a tokenize methods'

alphabet_tok_set = sm.AlphabeticTokenizer(return_set=True)

gj = sm.GeneralizedJaccard(sim_func=sm.Levenshtein().get_sim_score, threshold=0.8)
print(gj.get_raw_score(alphabet_tok_set.tokenize(str1),alphabet_tok_set.tokenize(str2)))

>>0.9761904761904763

So what is the problem?!

CodePudding user response:

The answer is that after considering the pair as a match, the similarity score of that pair used in Jaccard formula instead of 1.

  • Related