I have a dataframe and would like to aggregate based on if some values in column Result
exist or not. So if for any ìndex1
and ìndex2
there is an A
then my total_result
column should be A
. If there is no A
but there is a B
then my total_result
column should be B
, etc.
Below I have some working code but I was wondering if there is no easier way to achieve the same?
df = pd.DataFrame({'index1':[1,2,2,2,3,3,4], 'index2':[1,2,2,2,3,3,4], 'Result':['A','','B','As','C','B','D']})
df_out = (df.groupby(['index1', 'index2'])
.agg(A=('Result', lambda x: (x == 'A').any() | (x == 'As').any()),
B=('Result', lambda x: (x == 'B').any()),
C=('Result', lambda x: (x == 'C').any()),
D=('Result', lambda x: (x == 'D').any()))
.reset_index()
.assign(tot_result = lambda x: np.where(x['A'], 'A',
np.where(x['B'], 'B',
np.where(x['C'], 'C', 'D'))))
.drop(columns=['A', 'B', 'C', 'D'])
)
CodePudding user response:
Solution if always match at least one value - First create helper columns and then aggregate DataFrameGroupBy.any
, for first match column name use DataFrame.idxmax
:
df1 = (df.assign(A = (df.Result == 'A') | (df.Result == 'As'),
B = (df.Result == 'B'),
C = (df.Result == 'C'),
D = (df.Result == 'D'))
.groupby(['index1', 'index2'])[['A','B','C','D']]
.any()
.idxmax(axis=1)
.reset_index(name='tot_result'))
print (df1)
If possible no match:
df = pd.DataFrame({'index1':[1,2,2,2,3,3,4],
'index2':[1,2,2,2,3,3,4],
'Result':['A','','B','As','C','B','E']})
df1 = (df.assign(A = (df.Result == 'A') | (df.Result == 'As'),
B = (df.Result == 'B'),
C = (df.Result == 'C'),
D = (df.Result == 'D'))
.groupby(['index1', 'index2'])[['A','B','C','D']]
.any()
.replace({True:1, False:None})
.apply(lambda x: x.first_valid_index(), axis=1)
.reset_index(name='tot_result'))
print (df1)
index1 index2 tot_result
0 1 1 A
1 2 2 A
2 3 3 B
3 4 4 None