Home > Software engineering >  Find Jaccard similarity between multiple list
Find Jaccard similarity between multiple list

Time:03-07

Let's say I have 3 list:

l1 = ["a", "b", "c"]
l2 = ["c", "e", "f"]
l3 = ["c", "b", "a"]

For Jaccard similarity, I'm using the following function:

def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(set(list1))   len(set(list2))) - intersection
    return float(intersection) / union

How can I calculate the Jaccard similarity for all combinations, that is:

(l1,l1), (l1,l2), (l1, l3)
(l2,l1), (l2,l2), (l2, l3) 
(l3,l1), (l3,l2), (l3, l3)

I want to avoid doing this manually for each pair of lists. Also, the final output needs to be a 3x3 matrix.

CodePudding user response:

You can drop the list from list(set(...)) in your original function. Also no need to cast intersection to a float as you are using the "float division operator":

def jaccard_similarity(list1, list2):
    intersection = len(set(list1).intersection(list2))
    union = (len(set(list1))   len(set(list2))) - intersection
    return intersection / union

You can use product from the itertools module to generate pairs of lists, and consume them using starmap with your function:

from itertools import product, starmap

l1 = ['a', 'b', 'c']
l2 = ['c', 'e', 'f']
l3 = ['c', 'b', 'a']

inputs = product([l1, l2, l3], [l1, l2, l3])

result = list(starmap(jaccard_similarity, inputs))
print(result)

Output:

[1.0, 0.2, 1.0, 0.2, 1.0, 0.2, 1.0, 0.2, 1.0]

Next, to create a matrix you can take a look at the grouper recipe from the documentation of itertools: https://docs.python.org/3/library/itertools.html#itertools-recipes

Here's a simplified example of the grouper function:

def group_three(it):
    iterators = [iter(it)] * 3
    return zip(*iterators)

print(list(group_three(result)))

Output:

[(1.0, 0.2, 1.0), (0.2, 1.0, 0.2), (1.0, 0.2, 1.0)]
  • Related