Best way to compare elements pairwise in a Pandas DataFrame (generating a "cross product"-CodePudding

Good morning all, I have a dataframe in which one of the columns is made by lists:

import pandas as pd

df = pd.DataFrame({'Ind':['A','B','C','D'],'lists':[['dog','cat','horse','squirrel','bird'],
                       ['dog','horse','fish','whale'],
                       ['moose','cat','squirrel','ant','chicken'],
                       ['dog','moose','cat','bird','ant']]})

What I would like to achieve is a cross product matrix, in which I have an index of the "similarity" between every list pair (it is 0 when they are identical and 1 if they don't have any element in common). What I'm using rn is a simple recursion, that has the job done, but struggles of course when the dataframe dimension increase:

list_tot = []
for i in range(len(df)):
    list_temp = []
    for j in range(len(df)):
        list1 = df.iloc[i]['lists']
        list2 = df.iloc[j]['lists']
        minlist = min(len(list1),len(list2))
        dis = (minlist - len([el for el in list1 if el in list2]))/minlist
        list_temp.append(dis)
    list_tot.append(list_temp)

The output now is a list of lists, but could be whatever.

[[0.0, 0.5, 0.6, 0.4],
[0.5, 0.0, 1.0, 0.75],
[0.6, 1.0, 0.0, 0.4],
[0.4, 0.75, 0.4, 0.0]]

I also know that the output matrix is symmetric, so I could just calculate (N * (N 1)) / 2 similarity instead of N ** 2, but I'm not sure how to end up with the same output.

Thank you very much in advance.

CodePudding user response：

You can use itertools.combinations with a custom function to compute the similarity (here using 1 - jaccard similarity, you can use any function that takes 2 lists as input and returns a float), then a bit of numpy magic:

import numpy as np
from itertools import combinations

def similarity(l1, l2):
    s1 = set(l1)
    s2 = set(l2)
    return 1 - len(s1&s2)/len(s1|s2)

a = np.zeros((len(df), len(df)))
a[np.triu_indices(len(df), k=1)] = [similarity(a,b) for a,b in combinations(df['lists'], r=2)]
a  = a.T
np.fill_diagonal(a, 1)

out = pd.DataFrame(a, index=df['Ind'], columns=df['Ind'])

print(out)

output:

Ind         A         B         C         D
Ind                                        
A    0.000000  0.714286  0.750000  0.571429
B    0.714286  0.000000  1.000000  0.875000
C    0.750000  1.000000  0.000000  0.571429
D    0.571429  0.875000  0.571429  0.000000