compare pairs of values of a dictionary and then export the result to a dataframe-CodePudding

I have a df like this:

d = { 'user':[1,1,1,2,2,2,3,3],
     'id': ['C1t','S4g','OT2','3Ba','d6a,'d9o','tot','p5t'],
     'label': ['dog','cat','bird','table','tab','mop','mom','dad']}

df1 = pd.DataFrame(d)
print(df1)
   user   id  label
0     1  C1t    dog
1     1  Syg    cat
2     1  OT2   bird
3     2  3Ba  table
4     2  d6a   tab
5     2  d9o    mop
6     3  tot    mom
7     3  p5t    dad

Then i am creating a dictiionary with each user as a key which is like this:

from collections import defaultdict
df_to_dict = defaultdict(list)
for index,row in df1.iterrows():   
    df_to_dict[row["user"]].append(
        {"label": row["label"],
              'id':row['id']})

print(df_to_dict)
defaultdict(list,
            {1: [{'label': 'dog', 'id': 'C1t'},
              {'label': 'cat', 'id': 'S4g'},
              {'label': 'door', 'id': 'OT2'}],
             2: [{'label': 'table', 'id': '3Ba'},
              {'label': 'tab', 'id': 'd6a'},
              {'label': 'mop', 'id': 'd9o'}],
             3: [{'label': 'mom', 'id': 'tot'},
              {'label': 'dad', 'id': 'p5t'}]})

Now my aim is to check the string similarity for the pairs of values-rows of 'label' column but only per user (for instance for user 1 dog-cat, cat-bird, and dog-bird and generate a string similarity index derived from this function:

def split(word):
    return [char for char in word]

# function for calculating the jaccard distance
def DistJaccard(str1, str2):
    l1 = set(split(str1))
    l2 = set(split(str2))
    res = float(len(l1 & l2)) / len(l1 | l2)
    return res

My desired result is to create a dataframe with columns of 'user', 'id' of each record that was compared with the string similarity function, labels from both of them and the string similarity score. Something like this below:

   user id1  id2  label1 label2  similarity score
0   1   C1t   S4g  dog     cat       0.0
1   1   OT2   C1t  door    dog       0.5
2   1   OT2   S4g  door    cat       0.0
3   2   3Ba   d6a table    tab       0.6
4   2   d9o   3Ba  mop    table      0.0
5   2   d9o   d6a  mop     tab       0.0

# and so on for the user 3

So my problem is how to apply the string similarity function for each pair of 'label' per user and then export it all those to a dataframe. Any idea on how to approach this? I could just do it with just pandas operations but i have to do it first in a dictionary as it's way faster. Thank you!

CodePudding user response：

def create_similarity_df(df_to_dict):
    df_similarity = pd.DataFrame()
    for user in df_to_dict:
        for i in range(len(df_to_dict[user])):
            for j in range(i 1,len(df_to_dict[user])):
                df_similarity = df_similarity.append(
                    {'user':user,
                     'id1':df_to_dict[user][i]['id'],
                     'id2':df_to_dict[user][j]['id'],
                     'label1':df_to_dict[user][i]['label'],
                     'label2':df_to_dict[user][j]['label'],
                     'similarity':DistJaccard(df_to_dict[user][i]['label'],df_to_dict[user][j]['label'])},
                    ignore_index=True)
    return df_similarity

produces

   user  id1  id2 label1 label2  similarity
0   1.0  C1t  Syg    dog    cat    0.000000
1   1.0  C1t  OT2    dog   bird    0.166667
2   1.0  Syg  OT2    cat   bird    0.000000
3   2.0  eBa  dFa  table   door    0.000000
4   2.0  eBa  dzo  table    mop    0.000000
5   2.0  dFa  dzo   door    mop    0.200000
6   3.0  tot  p5t    mom    dad    0.000000