I'm using GeneralizedJaccard
from Py_stringmatching
package to measure the similarity between two strings.
According to this document:
... If the similarity of a token pair exceeds the threshold, then the token pair is considered a match ...
For example for word pair 'method' and 'methods' we have:
print(sm.Levenshtein().get_sim_score('method','methods'))
>>0.8571428571428572
The similarity between example word pair is 0.85 and greater than 0.80 ,So this pair must considered a match and I expect that the final GeneralizedJaccard output for two near-duplicate sentences to be equal to 1 but it's 0.97:
import py_stringmatching as sm
str1='All tokenizers have a tokenize method'
str2='All tokenizers have a tokenize methods'
alphabet_tok_set = sm.AlphabeticTokenizer(return_set=True)
gj = sm.GeneralizedJaccard(sim_func=sm.Levenshtein().get_sim_score, threshold=0.8)
print(gj.get_raw_score(alphabet_tok_set.tokenize(str1),alphabet_tok_set.tokenize(str2)))
>>0.9761904761904763
So what is the problem?!
CodePudding user response:
The answer is that after considering the pair as a match, the similarity score of that pair used in Jaccard formula instead of 1.