My dataframe looks like this:
df=
name group feedback question
a g1 False abc
a g1 True abc
a g1 True xyz
b g1 True xyz
b g1 True abc
c g1 False def
d g2 False xyz
d g2 True xyz
e g2 True xyz
f g3 True abc
f g3 True www
g g3 False xyz
h g4 True www
h g4 True qqq
i g4 False xyz
I also have a dataframe that contains pairs of these groups (and some other columns that are unimportant). It usually, but NOT always contains all the possible pairs.
df_2=
Group1 Group2 Value
g1 g2 0.25
g1 g3 0.50
g2 g3 0.25
g3 g4 0.50
I want to build a new dataframe that has the following header:
Group1 Group2 question ScoreGroup1 ScoreGroup2
The question
will be the common questions each pair has answered. The ScoreGroup1
is the number of Trues/(Trues Falses)
answers for that question for Group1
, analogously for Group2
. So in the case above:
Group1 Group2 question ScoreGroup1 ScoreGroup2
g1 g2 xyz 1 0.66
g1 g3 abc 0.66 1
g1 g3 xyz 1 0
g2 g3 xyz 0.66 0
g3 g4 www 1 1
g3 g4 xyz 0 0
What I did so far: grouping by group in the original dataframe and feeback.mean()
, but that only gives me an overall ranking of the groups. I need to make use of the pairing and the questions that are common. I found which questions are common by doing:
mutuals = df.groupby('group')['question'].apply(set)
But how do I compute these pairwise scores?
EDIT: These are the dataframes:
df = {'name':['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'e', 'f', 'f', 'g', 'h', 'h', 'i'],
'group':['g1', 'g1', 'g1', 'g1', 'g1', 'g1', 'g2', 'g2', 'g2', 'g3', 'g3', 'g3', 'g4', 'g4', 'g4'],
'feedback': [False, True, True, True, True, False, False, True, True, True, True, False, True, True, False],
'question': ['abc', 'abc', 'xyz', 'xyz', 'abc', 'def', 'xyz', 'xyz', 'xyz', 'abc', 'www', 'xyz', 'www', 'qqq', 'xyz]}
df_2 = {'Group1': ['g1', 'g1', 'g2', 'g3'],
'Group2': ['g2', 'g3', 'g3', 'g4'],
'Value': [0.25, 0.50, 0.25, 0.50]}
CodePudding user response:
Use DataFrame.merge
by same DataFrame
with remove rows if same values first, then aggregate mean:
df = (df.merge(df, on='question', suffixes=('1','2'))
.query('group1 != group2')
.groupby(['group1','group2','question'], as_index=False)
.mean())
print (df)
group1 group2 question feedback1 feedback2
0 g1 g2 xyz 1.000000 0.666667
1 g1 g3 abc 0.666667 1.000000
2 g1 g3 xyz 1.000000 0.000000
3 g1 g4 xyz 1.000000 0.000000
4 g2 g1 xyz 0.666667 1.000000
5 g2 g3 xyz 0.666667 0.000000
6 g2 g4 xyz 0.666667 0.000000
7 g3 g1 abc 1.000000 0.666667
8 g3 g1 xyz 0.000000 1.000000
9 g3 g2 xyz 0.000000 0.666667
10 g3 g4 www 1.000000 1.000000
11 g3 g4 xyz 0.000000 0.000000
12 g4 g1 xyz 0.000000 1.000000
13 g4 g2 xyz 0.000000 0.666667
14 g4 g3 www 1.000000 1.000000
15 g4 g3 xyz 0.000000 0.000000
And last create same columns names for inner join by df_2
:
d = {'Group1':'group1','Group2':'group2'}
df = df.merge(df_2[['Group1','Group2']].rename(columns=d))
print (df)
group1 group2 question feedback1 feedback2
0 g1 g2 xyz 1.000000 0.666667
1 g1 g3 abc 0.666667 1.000000
2 g1 g3 xyz 1.000000 0.000000
3 g2 g3 xyz 0.666667 0.000000
4 g3 g4 www 1.000000 1.000000
5 g3 g4 xyz 0.000000 0.000000