I have a huge json file (at least thousands of keys) like:
{"A": ["5-65", "5-66", "5-67", "6-12", "6-59"],
"B": ["5-65", "5-66", "5-67", "6-12", "6-59","7-13"],
"C": ["4-43","5-65", "5-66", "5-67", "6-59","7-12","7-13"]
}
each list has no duplicated values.
I want to calculate the pairwise similarity of this JSON file.
For example, the result should look like this:{"A-B": similarity1, "A-C": similarity2, "B-C": similarity3}
The Jaccard Distance might be used as similarity.
CodePudding user response:
It is not sure how you define similarity, let's assume this is the number of common elements.
You can use set methods combined with itertools.combinations
:
d = {"A": ["5-65", "5-66", "5-67", "6-12", "6-59"],
"B": ["5-65", "5-66", "5-67", "6-12", "6-59","7-13"],
"C": ["4-43","5-65", "5-66", "5-67", "6-59","7-12","7-13"]
}
## or to load fron json file
# import json
# with open('file.json') as f:
# d = json.load(f)
from itertools import combinations
out = {(a,b): len(set(d[a]).intersection(d[b])) for a,b in combinations(d, 2)}
output:
{('A', 'B'): 5, ('A', 'C'): 4, ('B', 'C'): 5}
For efficiency, it is better to compute the sets only once:
d2 = {k: set(v) for k,v in d.items()}
out = {(a,b): len(d2[a]&d2[b]) for a,b in combinations(d, 2)}
Jaccard similarity
this would be defined as the size of the intersection divided by the size of the union (see Jaccard index):
d2 = {k: set(v) for k,v in d.items()}
{(a,b): len(d2[a]&d2[b])/len(d2[a]|d2[b]) for a,b in combinations(d, 2)}
output:
{('A', 'B'): 0.8333333333333334, ('A', 'C'): 0.5, ('B', 'C'): 0.625}