Home > Back-end >  aggregate based on if column values exist
aggregate based on if column values exist

Time:11-08

I have a dataframe and would like to aggregate based on if some values in column Result exist or not. So if for any ìndex1 and ìndex2 there is an A then my total_result column should be A. If there is no A but there is a B then my total_result column should be B, etc. Below I have some working code but I was wondering if there is no easier way to achieve the same?

df = pd.DataFrame({'index1':[1,2,2,2,3,3,4], 'index2':[1,2,2,2,3,3,4], 'Result':['A','','B','As','C','B','D']})

df_out = (df.groupby(['index1', 'index2'])
        .agg(A=('Result', lambda x: (x == 'A').any() | (x == 'As').any()),
            B=('Result', lambda x: (x == 'B').any()), 
            C=('Result', lambda x: (x == 'C').any()), 
            D=('Result', lambda x: (x == 'D').any()))
       .reset_index()
       .assign(tot_result = lambda x: np.where(x['A'], 'A',
                                                   np.where(x['B'], 'B',
                                                            np.where(x['C'], 'C', 'D'))))

       .drop(columns=['A', 'B', 'C', 'D'])
       )

CodePudding user response:

Solution if always match at least one value - First create helper columns and then aggregate DataFrameGroupBy.any, for first match column name use DataFrame.idxmax:

df1 = (df.assign(A = (df.Result == 'A') | (df.Result == 'As'),
                B = (df.Result == 'B'),
                C = (df.Result == 'C'),
                D = (df.Result == 'D'))
        .groupby(['index1', 'index2'])[['A','B','C','D']]
        .any()
        .idxmax(axis=1)
        .reset_index(name='tot_result'))
print (df1)

If possible no match:

df = pd.DataFrame({'index1':[1,2,2,2,3,3,4], 
                   'index2':[1,2,2,2,3,3,4], 
                   'Result':['A','','B','As','C','B','E']})


df1 = (df.assign(A = (df.Result == 'A') | (df.Result == 'As'),
                B = (df.Result == 'B'),
                C = (df.Result == 'C'),
                D = (df.Result == 'D'))
        .groupby(['index1', 'index2'])[['A','B','C','D']]
        .any()
        .replace({True:1, False:None})
        .apply(lambda x: x.first_valid_index(), axis=1)
        .reset_index(name='tot_result'))
print (df1)
   index1  index2 tot_result
0       1       1          A
1       2       2          A
2       3       3          B
3       4       4       None
  • Related