Home > Net >  Pandas Groupby and Compare rows to find maximum value
Pandas Groupby and Compare rows to find maximum value

Time:09-20

I've a dataframe

a b c
one 6 11
one 7 12
two 8 23
two 9 14
three 10 15
three 20 25

I want to apply groupby at column a and then find the highest value in column c, so that, the highest value gets flagged, i.e.

a b c
one 6 11
one 7 12

Compare value 11&12, then

a b c
two 8 23
two 9 14

Compare value 23&14, then

a b c
three 10 15
three 20 25

Finally resulting in:

a b c flag
one 6 11 no
one 7 12 yes
two 8 23 yes
two 9 14 no
three 10 15 no
three 20 25 yes

I/P DF:

df = pd.DataFrame({
    'a':["one","one","two","two","three","three"]
    , 'b':[6,7,8,9,10,20]
    , 'c':[11,12,23,14,15,25]
    # , 'flag': ['no', 'yes', 'yes', 'no', 'no', 'yes']
})
df

CodePudding user response:

You can use groupby.transform to get the max value per group, and numpy.where to map the True/False to 'yes'/'no':

df['flag'] = np.where(df.groupby('a')['c'].transform('max').eq(df['c']), 'yes', 'no')

output:

       a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

Intermediates:

df.groupby('a')['c'].transform('max')

0    12
1    12
2    23
3    23
4    25
5    25
Name: c, dtype: int64

df.groupby('a')['c'].transform('max').eq(df['c'])
0    False
1     True
2     True
3    False
4    False
5     True
Name: c, dtype: bool

CodePudding user response:

Use GroupBy.transform with max, comapre to same column c and then set yes/no in numpy.where:

df['flag'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')

print(df)
       a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

If multiple values per a with maximal values get multiple yes, if need only first maximal values use DataFrameGroupBy.idxmax and compare df.index:

df = pd.DataFrame({
    'a':["one","one","one","two","three","three"]
    , 'b':[6,7,8,9,10,20]
    , 'c':[11,12,12,14,15,25]
})

df['flag1'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')
df['flag2'] = np.where(df.index == df.groupby('a')['c'].transform('idxmax'), 'yes', 'no')

print(df)

       a   b   c flag1 flag2
0    one   6  11    no    no
1    one   7  12   yes   yes
2    one   8  12   yes    no <- difference for match all max or first max
3    two   9  14   yes   yes
4  three  10  15    no    no
5  three  20  25   yes   yes

CodePudding user response:

One way to do that is as follows

df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df.groupby('a')['c'].max().values and x['a'] == df.groupby('c')['a'].max().loc[x['c']] else 'no', axis=1)

       a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

Breaking down the various steps that one is doing above

  • df['flag'] creates the new column named flag.

  • df.groupby('a')['c'].max() will group by column a, with pandas.DataFrame.groupby, and find the highest value in column c.

     df2 = df.groupby('a')['c'].max()
    
  • Then we check if the value is in the dataframe generated in step 2 AND if the group is the same.

     df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df2.values and x['a'] == df2.loc[x['c']] else 'no', axis=1)
    

Notes:

  • Checking if the group is the same is key, else, even though it was working for this specific case, it wouldn't work if a group had a non-max value that was the max value of another group (as mozway mentioned).

  • As indicated in the answer that jezrael shared, .apply can be slow and, even though does the work, it might not be the most convenient way to do that.

  • Related