Pandas Groupby and Compare rows to find maximum value-CodePudding

I've a dataframe

a	b	c
one	6	11
one	7	12
two	8	23
two	9	14
three	10	15
three	20	25

I want to apply groupby at column a and then find the highest value in column c, so that, the highest value gets flagged, i.e.

a	b	c
one	6	11
one	7	12

Compare value 11&12, then

a	b	c
two	8	23
two	9	14

Compare value 23&14, then

a	b	c
three	10	15
three	20	25

Finally resulting in:

a	b	c	flag
one	6	11	no
one	7	12	yes
two	8	23	yes
two	9	14	no
three	10	15	no
three	20	25	yes

I/P DF:

df = pd.DataFrame({
    'a':["one","one","two","two","three","three"]
    , 'b':[6,7,8,9,10,20]
    , 'c':[11,12,23,14,15,25]
    # , 'flag': ['no', 'yes', 'yes', 'no', 'no', 'yes']
})
df

CodePudding user response：

You can use groupby.transform to get the max value per group, and numpy.where to map the True/False to 'yes'/'no':

df['flag'] = np.where(df.groupby('a')['c'].transform('max').eq(df['c']), 'yes', 'no')

output:

       a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

Intermediates:

df.groupby('a')['c'].transform('max')

0    12
1    12
2    23
3    23
4    25
5    25
Name: c, dtype: int64

df.groupby('a')['c'].transform('max').eq(df['c'])
0    False
1     True
2     True
3    False
4    False
5     True
Name: c, dtype: bool

CodePudding user response：

Use GroupBy.transform with max, comapre to same column c and then set yes/no in numpy.where:

df['flag'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')

print(df)
       a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

If multiple values per a with maximal values get multiple yes, if need only first maximal values use DataFrameGroupBy.idxmax and compare df.index:

df = pd.DataFrame({
    'a':["one","one","one","two","three","three"]
    , 'b':[6,7,8,9,10,20]
    , 'c':[11,12,12,14,15,25]
})

df['flag1'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')
df['flag2'] = np.where(df.index == df.groupby('a')['c'].transform('idxmax'), 'yes', 'no')

print(df)

       a   b   c flag1 flag2
0    one   6  11    no    no
1    one   7  12   yes   yes
2    one   8  12   yes    no <- difference for match all max or first max
3    two   9  14   yes   yes
4  three  10  15    no    no
5  three  20  25   yes   yes

CodePudding user response：

One way to do that is as follows

df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df.groupby('a')['c'].max().values and x['a'] == df.groupby('c')['a'].max().loc[x['c']] else 'no', axis=1)

       a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

Breaking down the various steps that one is doing above

df['flag'] creates the new column named flag.
df.groupby('a')['c'].max() will group by column a, with pandas.DataFrame.groupby, and find the highest value in column c.
```
 df2 = df.groupby('a')['c'].max()
```

Then we check if the value is in the dataframe generated in step 2 AND if the group is the same.

 df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df2.values and x['a'] == df2.loc[x['c']] else 'no', axis=1)

Notes:

Checking if the group is the same is key, else, even though it was working for this specific case, it wouldn't work if a group had a non-max value that was the max value of another group (as mozway mentioned).
As indicated in the answer that jezrael shared, .apply can be slow and, even though does the work, it might not be the most convenient way to do that.