Good morning all, I have a dataframe in which one of the columns is made by lists:
import pandas as pd
df = pd.DataFrame({'Ind':['A','B','C','D'],'lists':[['dog','cat','horse','squirrel','bird'],
['dog','horse','fish','whale'],
['moose','cat','squirrel','ant','chicken'],
['dog','moose','cat','bird','ant']]})
What I would like to achieve is a cross product matrix, in which I have an index of the "similarity" between every list pair (it is 0 when they are identical and 1 if they don't have any element in common). What I'm using rn is a simple recursion, that has the job done, but struggles of course when the dataframe dimension increase:
list_tot = []
for i in range(len(df)):
list_temp = []
for j in range(len(df)):
list1 = df.iloc[i]['lists']
list2 = df.iloc[j]['lists']
minlist = min(len(list1),len(list2))
dis = (minlist - len([el for el in list1 if el in list2]))/minlist
list_temp.append(dis)
list_tot.append(list_temp)
The output now is a list of lists, but could be whatever.
[[0.0, 0.5, 0.6, 0.4],
[0.5, 0.0, 1.0, 0.75],
[0.6, 1.0, 0.0, 0.4],
[0.4, 0.75, 0.4, 0.0]]
I also know that the output matrix is symmetric, so I could just calculate (N * (N 1)) / 2 similarity instead of N ** 2, but I'm not sure how to end up with the same output.
Thank you very much in advance.
CodePudding user response:
You can use itertools.combinations
with a custom function to compute the similarity (here using 1 - jaccard similarity, you can use any function that takes 2 lists as input and returns a float), then a bit of numpy magic:
import numpy as np
from itertools import combinations
def similarity(l1, l2):
s1 = set(l1)
s2 = set(l2)
return 1 - len(s1&s2)/len(s1|s2)
a = np.zeros((len(df), len(df)))
a[np.triu_indices(len(df), k=1)] = [similarity(a,b) for a,b in combinations(df['lists'], r=2)]
a = a.T
np.fill_diagonal(a, 1)
out = pd.DataFrame(a, index=df['Ind'], columns=df['Ind'])
print(out)
output:
Ind A B C D
Ind
A 0.000000 0.714286 0.750000 0.571429
B 0.714286 0.000000 1.000000 0.875000
C 0.750000 1.000000 0.000000 0.571429
D 0.571429 0.875000 0.571429 0.000000