Home > Mobile >  Second largest value is showing wrong value on pandas dataframe, when groupby
Second largest value is showing wrong value on pandas dataframe, when groupby

Time:11-20

I have a table, and I'm trying to get the second largest "percent" value by column "Day". I can get the second largest value, but the column 'Hour', is not the right one

Table:df

name Day Hour percent
000_RJ_S1 26 10 0.908494
000_RJ_S1 26 11 0.831482
000_RJ_S1 26 12 0.843846
000_RJ_S1 26 13 0.877238
000_RJ_S1 26 17 0.163908
000_RJ_S1 26 18 0.230296
000_RJ_S1 26 19 0.359440
000_RJ_S1 26 20 0.379988

Script Used: df = df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent').min())

Output: As you can see, the "Hour" value is wrong. It should be "13" and not "10". The second largest value is right.

name Day Hour percent
000_RJ_S1 26 10 0.877238

It should be:

name Day Hour percent
000_RJ_S1 26 13 0.877238

I can't figure out what's is wrong. Could you guys help me with this issue.

Thanks a lot

CodePudding user response:

One solution is to use a double groupby:

cols = ['name','Day']
# get the top 2 per group
s = df.groupby(cols)['percent'].nlargest(2)
# get the index of min per group
idx = s.droplevel(cols).groupby(s.droplevel(-1).index).idxmin()
# slice original data with those indexes
df2 = df.loc[idx.values]

Output:

        name  Day  Hour   percent
3  000_RJ_S1   26    13  0.877238

CodePudding user response:

The reason you have got 10 is because of the min() function.

The nlargest() in the lambda would return the two rows with largest percent values and when you apply min() what it does is it selects the minimum values from each column separately which gave you that output.

You can use iloc[1] instead of min() to get the desired result

Here's the code using iloc:

df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent')).iloc[1]

CodePudding user response:

Sort the percent columns before grouping, and use the nth function instead:

(df.sort_values('percent', ascending=False)
  .groupby(['name', 'Day'],sort=False, as_index = False)
  .nth(1)
) 
        name  Day  Hour   percent
3  000_RJ_S1   26    13  0.877238
  • Related