I've a dataframe
a | b | c |
---|---|---|
one | 6 | 11 |
one | 7 | 12 |
two | 8 | 23 |
two | 9 | 14 |
three | 10 | 15 |
three | 20 | 25 |
I want to apply groupby at column a
and then find the highest value in column c
, so that, the highest value gets flagged, i.e.
a | b | c |
---|---|---|
one | 6 | 11 |
one | 7 | 12 |
Compare value 11&12, then
a | b | c |
---|---|---|
two | 8 | 23 |
two | 9 | 14 |
Compare value 23&14, then
a | b | c |
---|---|---|
three | 10 | 15 |
three | 20 | 25 |
Finally resulting in:
a | b | c | flag |
---|---|---|---|
one | 6 | 11 | no |
one | 7 | 12 | yes |
two | 8 | 23 | yes |
two | 9 | 14 | no |
three | 10 | 15 | no |
three | 20 | 25 | yes |
I/P DF:
df = pd.DataFrame({
'a':["one","one","two","two","three","three"]
, 'b':[6,7,8,9,10,20]
, 'c':[11,12,23,14,15,25]
# , 'flag': ['no', 'yes', 'yes', 'no', 'no', 'yes']
})
df
CodePudding user response:
You can use groupby.transform
to get the max value per group, and numpy.where
to map the True
/False
to 'yes'
/'no'
:
df['flag'] = np.where(df.groupby('a')['c'].transform('max').eq(df['c']), 'yes', 'no')
output:
a b c flag
0 one 6 11 no
1 one 7 12 yes
2 two 8 23 yes
3 two 9 14 no
4 three 10 15 no
5 three 20 25 yes
Intermediates:
df.groupby('a')['c'].transform('max')
0 12
1 12
2 23
3 23
4 25
5 25
Name: c, dtype: int64
df.groupby('a')['c'].transform('max').eq(df['c'])
0 False
1 True
2 True
3 False
4 False
5 True
Name: c, dtype: bool
CodePudding user response:
Use GroupBy.transform
with max
, comapre to same column c
and then set yes/no
in numpy.where
:
df['flag'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')
print(df)
a b c flag
0 one 6 11 no
1 one 7 12 yes
2 two 8 23 yes
3 two 9 14 no
4 three 10 15 no
5 three 20 25 yes
If multiple values per a
with maximal values get multiple yes
, if need only first maximal values use DataFrameGroupBy.idxmax
and compare df.index
:
df = pd.DataFrame({
'a':["one","one","one","two","three","three"]
, 'b':[6,7,8,9,10,20]
, 'c':[11,12,12,14,15,25]
})
df['flag1'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')
df['flag2'] = np.where(df.index == df.groupby('a')['c'].transform('idxmax'), 'yes', 'no')
print(df)
a b c flag1 flag2
0 one 6 11 no no
1 one 7 12 yes yes
2 one 8 12 yes no <- difference for match all max or first max
3 two 9 14 yes yes
4 three 10 15 no no
5 three 20 25 yes yes
CodePudding user response:
One way to do that is as follows
df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df.groupby('a')['c'].max().values and x['a'] == df.groupby('c')['a'].max().loc[x['c']] else 'no', axis=1)
a b c flag
0 one 6 11 no
1 one 7 12 yes
2 two 8 23 yes
3 two 9 14 no
4 three 10 15 no
5 three 20 25 yes
Breaking down the various steps that one is doing above
df['flag']
creates the new column namedflag
.df.groupby('a')['c'].max()
will group by columna
, withpandas.DataFrame.groupby
, and find the highest value in columnc
.df2 = df.groupby('a')['c'].max()
Then we check if the value is in the dataframe generated in step 2 AND if the group is the same.
df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df2.values and x['a'] == df2.loc[x['c']] else 'no', axis=1)
Notes:
Checking if the group is the same is key, else, even though it was working for this specific case, it wouldn't work if a group had a non-max value that was the max value of another group (as mozway mentioned).
As indicated in the answer that jezrael shared,
.apply
can be slow and, even though does the work, it might not be the most convenient way to do that.