If the dataframe looks like this:
Class1 Class2 Exam Score1 Score2
a1 a2 a 0.90 0.70
a1 a2 b 0.80 0.70
a1 a2 c 0.10 0.70
a1 a3 c 0.10 0.20
a2 a3 c 0.70 0.20
a2 a3 d 0.30 0.10
a3 a4 e 1.00 0.80
a3 a4 f 0.00 0.70
a3 a5 e 1.00 0.90
a3 a5 f 0.00 0.90
a4 a5 e 0.80 0.90
a4 a5 f 0.70 0.90
I want to compute the mean for each class, but per pair (a1-a2, a1-a3 etc., just like in the dataframe above). I did the most naive thing and split the dataframe into multiple ones and then computed the mean over the whole Scores columns. But this is impossible when the dataframe has hundreds of pairs. How could I group by pair?
The desired output:
Class1 Class2 Average1 Average2
a1 a2 0.60 0.70
a1 a3 0.10 0.20
a2 a3 0.50 0.15
a3 a4 0.50 0.75
a3 a5 0.50 0.90
a4 a5 0.75 0.90
CodePudding user response:
Simply groupby
both columns. Named Aggregations here can help to both aggregate and rename in the same step:
new_df = df.groupby(
['Class1', 'Class2'], as_index=False
).agg(Average1=('Score1', 'mean'), Average2=('Score2', 'mean'))
new_df
:
Class1 Class2 Average1 Average2
0 a1 a2 0.60 0.70
1 a1 a3 0.10 0.20
2 a2 a3 0.50 0.15
3 a3 a4 0.50 0.75
4 a3 a5 0.50 0.90
5 a4 a5 0.75 0.90
Setup and imports:
import pandas as pd
df = pd.DataFrame({
'Class1': ['a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a3', 'a3', 'a3', 'a3', 'a4',
'a4'],
'Class2': ['a2', 'a2', 'a2', 'a3', 'a3', 'a3', 'a4', 'a4', 'a5', 'a5', 'a5',
'a5'],
'Exam': ['a', 'b', 'c', 'c', 'c', 'd', 'e', 'f', 'e', 'f', 'e', 'f'],
'Score1': [0.9, 0.8, 0.1, 0.1, 0.7, 0.3, 1.0, 0.0, 1.0, 0.0, 0.8, 0.7],
'Score2': [0.7, 0.7, 0.7, 0.2, 0.2, 0.1, 0.8, 0.7, 0.9, 0.9, 0.9, 0.9]
})