I have three pandas dataframe, suppose group_1, group_2, group_3
import pandas as pd
group_1 = pd.DataFrame({'A':[1,0,1,1,1], 'B':[1,1,1,1,1]})
group_2 = pd.DataFrame({'A':[1,1,1,1,1], 'B':[1,1,0,0,0]})
group_3 = pd.DataFrame({'A':[1,1,1,1,1], 'B':[0,0,0,0,0]})
filled dummy value, all value will be binary for above group
Now, there is another dataframe , new one
new_data_frame = pd.DataFrame({'A':[1,1,1,1,1], 'B':[0,0,0,0,0],'mobile': ['xxxxx','yyyyy','zzzzz','wwwww','mmmmmm']})
new_data_frame.set_index('mobilenumber')
A B
mobile
xxxxx 1 0
yyyyy 1 0
zzzzz 1 0
wwwww 1 0
mmmmmm 1 0
For each mobile in new_dataframe, I want to calculate mean cosine similarity(sum all score and divide by length of group dataframe), mobile number which have highest score will be assign to a particular group
So my expected output will be
mobile group
xxxxx group_1
yyyyy group_1
zzzzz group_3
something like this
for x in new_data_frame.to_numpy():
score = []
for y in group_1.to_numpy():
a = cosine_similarity(x,y)
score.append(a)
mean_score = sum(score)/len(y)
I have added below code , is there a better way to achive this
def max_group(x,group_1, group_2, group_3 ):
x_ = x.tolist()
val = x_[:-1]
group = [group_1, group_2, group_3]
score = []
for i in range(len(group)):
a = cosine_similarity([val], group[i].to_numpy())
print('<---->')
print(a.mean())
score.append((a.mean(), i))
return max(score[1])
new_data_frame['group'] = new_data_frame.apply(lambda x: max_group(x, group_1, group_2, group_3), axis=1)
CodePudding user response:
Solution
Create a mapping of group names and values then for each group calculate the mean cosine similarity inside a dict comprehension, then create a new dataframe from the computed scores and use idxmax
to find the name of group having max mean similarity score
from sklearn.metrics.pairwise import cosine_similarity
grps = {'group_1': group_1, 'group_2': group_2, 'group_3': group_3}
scores = {k: cosine_similarity(new_data_frame, g).mean(1) for k, g in grps.items()}
pd.DataFrame(scores, index=new_data_frame.index).idxmax(1)
Result
mobile
xxxxx group_3
yyyyy group_3
zzzzz group_3
wwwww group_3
mmmmmm group_3
dtype: object