I have a table, and I'm trying to get the second largest "percent" value by column "Day". I can get the second largest value, but the column 'Hour', is not the right one
Table:df
name | Day | Hour | percent |
---|---|---|---|
000_RJ_S1 | 26 | 10 | 0.908494 |
000_RJ_S1 | 26 | 11 | 0.831482 |
000_RJ_S1 | 26 | 12 | 0.843846 |
000_RJ_S1 | 26 | 13 | 0.877238 |
000_RJ_S1 | 26 | 17 | 0.163908 |
000_RJ_S1 | 26 | 18 | 0.230296 |
000_RJ_S1 | 26 | 19 | 0.359440 |
000_RJ_S1 | 26 | 20 | 0.379988 |
Script Used: df = df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent').min())
Output: As you can see, the "Hour" value is wrong. It should be "13" and not "10". The second largest value is right.
name | Day | Hour | percent |
---|---|---|---|
000_RJ_S1 | 26 | 10 | 0.877238 |
It should be:
name | Day | Hour | percent |
---|---|---|---|
000_RJ_S1 | 26 | 13 | 0.877238 |
I can't figure out what's is wrong. Could you guys help me with this issue.
Thanks a lot
CodePudding user response:
One solution is to use a double groupby
:
cols = ['name','Day']
# get the top 2 per group
s = df.groupby(cols)['percent'].nlargest(2)
# get the index of min per group
idx = s.droplevel(cols).groupby(s.droplevel(-1).index).idxmin()
# slice original data with those indexes
df2 = df.loc[idx.values]
Output:
name Day Hour percent
3 000_RJ_S1 26 13 0.877238
CodePudding user response:
The reason you have got 10 is because of the min()
function.
The nlargest()
in the lambda would return the two rows with largest percent values and when you apply min()
what it does is it selects the minimum values from each column separately which gave you that output.
You can use iloc[1]
instead of min()
to get the desired result
Here's the code using iloc:
df.groupby(['name','Day'])[['Hour','percent']].apply(lambda x: x.nlargest(2, columns='percent')).iloc[1]
CodePudding user response:
Sort the percent
columns before grouping, and use the nth
function instead:
(df.sort_values('percent', ascending=False)
.groupby(['name', 'Day'],sort=False, as_index = False)
.nth(1)
)
name Day Hour percent
3 000_RJ_S1 26 13 0.877238