I have a df like this:
d = { 'user':[1,1,1,2,2,2,3,3],
'id': ['C1t','S4g','OT2','3Ba','d6a,'d9o','tot','p5t'],
'label': ['dog','cat','bird','table','tab','mop','mom','dad']}
df1 = pd.DataFrame(d)
print(df1)
user id label
0 1 C1t dog
1 1 Syg cat
2 1 OT2 bird
3 2 3Ba table
4 2 d6a tab
5 2 d9o mop
6 3 tot mom
7 3 p5t dad
Then i am creating a dictiionary with each user as a key which is like this:
from collections import defaultdict
df_to_dict = defaultdict(list)
for index,row in df1.iterrows():
df_to_dict[row["user"]].append(
{"label": row["label"],
'id':row['id']})
print(df_to_dict)
defaultdict(list,
{1: [{'label': 'dog', 'id': 'C1t'},
{'label': 'cat', 'id': 'S4g'},
{'label': 'door', 'id': 'OT2'}],
2: [{'label': 'table', 'id': '3Ba'},
{'label': 'tab', 'id': 'd6a'},
{'label': 'mop', 'id': 'd9o'}],
3: [{'label': 'mom', 'id': 'tot'},
{'label': 'dad', 'id': 'p5t'}]})
Now my aim is to check the string similarity for the pairs of values-rows of 'label' column but only per user (for instance for user 1 dog-cat, cat-bird, and dog-bird and generate a string similarity index derived from this function:
def split(word):
return [char for char in word]
# function for calculating the jaccard distance
def DistJaccard(str1, str2):
l1 = set(split(str1))
l2 = set(split(str2))
res = float(len(l1 & l2)) / len(l1 | l2)
return res
My desired result is to create a dataframe with columns of 'user', 'id' of each record that was compared with the string similarity function, labels from both of them and the string similarity score. Something like this below:
user id1 id2 label1 label2 similarity score
0 1 C1t S4g dog cat 0.0
1 1 OT2 C1t door dog 0.5
2 1 OT2 S4g door cat 0.0
3 2 3Ba d6a table tab 0.6
4 2 d9o 3Ba mop table 0.0
5 2 d9o d6a mop tab 0.0
# and so on for the user 3
So my problem is how to apply the string similarity function for each pair of 'label' per user and then export it all those to a dataframe. Any idea on how to approach this? I could just do it with just pandas operations but i have to do it first in a dictionary as it's way faster. Thank you!
CodePudding user response:
def create_similarity_df(df_to_dict):
df_similarity = pd.DataFrame()
for user in df_to_dict:
for i in range(len(df_to_dict[user])):
for j in range(i 1,len(df_to_dict[user])):
df_similarity = df_similarity.append(
{'user':user,
'id1':df_to_dict[user][i]['id'],
'id2':df_to_dict[user][j]['id'],
'label1':df_to_dict[user][i]['label'],
'label2':df_to_dict[user][j]['label'],
'similarity':DistJaccard(df_to_dict[user][i]['label'],df_to_dict[user][j]['label'])},
ignore_index=True)
return df_similarity
produces
user id1 id2 label1 label2 similarity
0 1.0 C1t Syg dog cat 0.000000
1 1.0 C1t OT2 dog bird 0.166667
2 1.0 Syg OT2 cat bird 0.000000
3 2.0 eBa dFa table door 0.000000
4 2.0 eBa dzo table mop 0.000000
5 2.0 dFa dzo door mop 0.200000
6 3.0 tot p5t mom dad 0.000000