Second largest value is showing wrong value on pandas dataframe, when groupby-CodePudding

I have a table, and I'm trying to get the second largest "percent" value by column "Day". I can get the second largest value, but the column 'Hour', is not the right one

Table:df

name	Day	Hour	percent
000_RJ_S1	26	10	0.908494
000_RJ_S1	26	11	0.831482
000_RJ_S1	26	12	0.843846
000_RJ_S1	26	13	0.877238
000_RJ_S1	26	17	0.163908
000_RJ_S1	26	18	0.230296
000_RJ_S1	26	19	0.359440
000_RJ_S1	26	20	0.379988

Script Used: df = df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent').min())

Output: As you can see, the "Hour" value is wrong. It should be "13" and not "10". The second largest value is right.

name	Day	Hour	percent
000_RJ_S1	26	10	0.877238

It should be:

name	Day	Hour	percent
000_RJ_S1	26	13	0.877238

I can't figure out what's is wrong. Could you guys help me with this issue.

Thanks a lot

CodePudding user response：

One solution is to use a double groupby:

cols = ['name','Day']
# get the top 2 per group
s = df.groupby(cols)['percent'].nlargest(2)
# get the index of min per group
idx = s.droplevel(cols).groupby(s.droplevel(-1).index).idxmin()
# slice original data with those indexes
df2 = df.loc[idx.values]

Output:

        name  Day  Hour   percent
3  000_RJ_S1   26    13  0.877238

CodePudding user response：

The reason you have got 10 is because of the min() function.

The nlargest() in the lambda would return the two rows with largest percent values and when you apply min() what it does is it selects the minimum values from each column separately which gave you that output.

You can use iloc[1] instead of min() to get the desired result

Here's the code using iloc:

df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent')).iloc[1]

CodePudding user response：

Sort the percent columns before grouping, and use the nth function instead:

(df.sort_values('percent', ascending=False)
  .groupby(['name', 'Day'],sort=False, as_index = False)
  .nth(1)
) 
        name  Day  Hour   percent
3  000_RJ_S1   26    13  0.877238