Home > database >  how to calculate pairwise similarity (Jaccard Distance) between sets
how to calculate pairwise similarity (Jaccard Distance) between sets

Time:05-03

I have a huge json file (at least thousands of keys) like:

{"A": ["5-65", "5-66", "5-67", "6-12", "6-59"],
 "B": ["5-65", "5-66", "5-67", "6-12", "6-59","7-13"],
 "C": ["4-43","5-65", "5-66", "5-67", "6-59","7-12","7-13"]
}

each list has no duplicated values.

I want to calculate the pairwise similarity of this JSON file. For example, the result should look like this:{"A-B": similarity1, "A-C": similarity2, "B-C": similarity3} The Jaccard Distance might be used as similarity.

CodePudding user response:

It is not sure how you define similarity, let's assume this is the number of common elements.

You can use set methods combined with itertools.combinations:

d = {"A": ["5-65", "5-66", "5-67", "6-12", "6-59"],
     "B": ["5-65", "5-66", "5-67", "6-12", "6-59","7-13"],
     "C": ["4-43","5-65", "5-66", "5-67", "6-59","7-12","7-13"]
     }

## or to load fron json file
# import json
# with open('file.json') as f:
#    d = json.load(f)

from itertools import combinations

out = {(a,b): len(set(d[a]).intersection(d[b])) for a,b in combinations(d, 2)}

output:

{('A', 'B'): 5, ('A', 'C'): 4, ('B', 'C'): 5}

For efficiency, it is better to compute the sets only once:

d2 = {k: set(v) for k,v in d.items()}
out = {(a,b): len(d2[a]&d2[b]) for a,b in combinations(d, 2)}

Jaccard similarity

this would be defined as the size of the intersection divided by the size of the union (see Jaccard index):

d2 = {k: set(v) for k,v in d.items()}
{(a,b): len(d2[a]&d2[b])/len(d2[a]|d2[b]) for a,b in combinations(d, 2)}

output:

{('A', 'B'): 0.8333333333333334, ('A', 'C'): 0.5, ('B', 'C'): 0.625}
  • Related