Home > Mobile >  Find cosine similarity between different pandas dataframe
Find cosine similarity between different pandas dataframe

Time:09-03

I have three pandas dataframe, suppose group_1, group_2, group_3

import pandas as pd 
   group_1 = pd.DataFrame({'A':[1,0,1,1,1], 'B':[1,1,1,1,1]})
   group_2 = pd.DataFrame({'A':[1,1,1,1,1], 'B':[1,1,0,0,0]})
   group_3 = pd.DataFrame({'A':[1,1,1,1,1], 'B':[0,0,0,0,0]})

filled dummy value, all value will be binary for above group

Now, there is another dataframe , new one

new_data_frame = pd.DataFrame({'A':[1,1,1,1,1], 'B':[0,0,0,0,0],'mobile': ['xxxxx','yyyyy','zzzzz','wwwww','mmmmmm']})
new_data_frame.set_index('mobilenumber')

         A  B
 mobile     
 xxxxx  1   0
 yyyyy  1   0
 zzzzz  1   0
 wwwww  1   0
 mmmmmm 1   0

For each mobile in new_dataframe, I want to calculate mean cosine similarity(sum all score and divide by length of group dataframe), mobile number which have highest score will be assign to a particular group

So my expected output will be

   mobile group
   xxxxx  group_1
   yyyyy  group_1
   zzzzz  group_3
   

something like this

  for x in new_data_frame.to_numpy():
      score = []
      for y in group_1.to_numpy():
         a =  cosine_similarity(x,y)
         score.append(a)
      mean_score = sum(score)/len(y)

I have added below code , is there a better way to achive this

def max_group(x,group_1, group_2, group_3 ):
    x_ = x.tolist()
    val =  x_[:-1]

    group = [group_1, group_2, group_3]

    score = []
    for i in range(len(group)):
        a = cosine_similarity([val], group[i].to_numpy())
        print('<---->')
        print(a.mean())
        score.append((a.mean(), i))

    return max(score[1])
 
 new_data_frame['group'] = new_data_frame.apply(lambda x: max_group(x, group_1, group_2, group_3), axis=1)

enter image description here

CodePudding user response:

Solution

Create a mapping of group names and values then for each group calculate the mean cosine similarity inside a dict comprehension, then create a new dataframe from the computed scores and use idxmax to find the name of group having max mean similarity score

from sklearn.metrics.pairwise import cosine_similarity

grps = {'group_1': group_1, 'group_2': group_2, 'group_3': group_3}
scores = {k: cosine_similarity(new_data_frame, g).mean(1) for k, g in grps.items()}

pd.DataFrame(scores, index=new_data_frame.index).idxmax(1)

Result

mobile
xxxxx     group_3
yyyyy     group_3
zzzzz     group_3
wwwww     group_3
mmmmmm    group_3
dtype: object
  • Related