how to calculate pairwise similarity (Jaccard Distance) between sets-CodePudding

I have a huge json file (at least thousands of keys) like:

{"A": ["5-65", "5-66", "5-67", "6-12", "6-59"],
 "B": ["5-65", "5-66", "5-67", "6-12", "6-59","7-13"],
 "C": ["4-43","5-65", "5-66", "5-67", "6-59","7-12","7-13"]
}

each list has no duplicated values.

I want to calculate the pairwise similarity of this JSON file. For example, the result should look like this:{"A-B": similarity1, "A-C": similarity2, "B-C": similarity3} The Jaccard Distance might be used as similarity.

CodePudding user response：

It is not sure how you define similarity, let's assume this is the number of common elements.

You can use set methods combined with itertools.combinations:

d = {"A": ["5-65", "5-66", "5-67", "6-12", "6-59"],
     "B": ["5-65", "5-66", "5-67", "6-12", "6-59","7-13"],
     "C": ["4-43","5-65", "5-66", "5-67", "6-59","7-12","7-13"]
     }

## or to load fron json file
# import json
# with open('file.json') as f:
#    d = json.load(f)

from itertools import combinations

out = {(a,b): len(set(d[a]).intersection(d[b])) for a,b in combinations(d, 2)}

output:

{('A', 'B'): 5, ('A', 'C'): 4, ('B', 'C'): 5}

For efficiency, it is better to compute the sets only once:

d2 = {k: set(v) for k,v in d.items()}
out = {(a,b): len(d2[a]&d2[b]) for a,b in combinations(d, 2)}

Jaccard similarity

this would be defined as the size of the intersection divided by the size of the union (see Jaccard index):

d2 = {k: set(v) for k,v in d.items()}
{(a,b): len(d2[a]&d2[b])/len(d2[a]|d2[b]) for a,b in combinations(d, 2)}

output:

{('A', 'B'): 0.8333333333333334, ('A', 'C'): 0.5, ('B', 'C'): 0.625}