Pandas groupby nlargest slice-CodePudding

There were kind of similar named questions, but they do not reflect the use case I am facing. I have a dataframe with groups and values. I want to select values sliced by their order (confusing maybe, example will explain better).

This is my data:

group   value
a   20
a   16
a   14
a   13
a   12
b   19
b   17
b   16
b   14
b   13
b   12
b   12
b   11

I want to group by group and slice [a:b] with nlargest logic, in other words, if a = 2 and b = 7 the biggest 3rd, 4th, 5ht, 6th and 7th variables per each group. I could not find any question here on this use case, or could I find something in pandas-dev github.

If there are less than b elements in any of the groups, then b = len(of that group) should be applied. If there are two or more elements with the same value, they should all be selected if they are within the [a:b] slice.

My desired result looks like this:

group   value
a   14
a   13
a   12
b   16
b   14
b   13
b   12
b   12

Here, the group a has 5 elements which is less than b in the example and because of that, 3rd to the 5th biggest elements are returned. In group b 6th and 7th biggest values are the same, so they are both returned.

The closest question to mine is this question about slice but it does not use nlargest logic. It just slices the groups.

If you could guide me on that, I would appreciate!

CodePudding user response：

You could try the following:

import pandas as pd

gbg = df.groupby('group')

a=2
b=7
res = gbg['value'].agg(lambda x: pd.Series.to_list(x)[a:b]).to_frame().explode('value').reset_index()
# .agg will "aggregate" the groups, here it will create the slices by group
# .to_frame will convert results from pd.Series to pd.DataFrame
# .explode() will write the list values in rows again
# .reset_index() will restore the column 'group'

The intermediate result after .agg():

group
a            [14, 13, 12]
b    [16, 14, 13, 12, 12]
Name: value, dtype: object

And the full result:

  group value
0     a    14
1     a    13
2     a    12
3     b    16
4     b    14
5     b    13
6     b    12
7     b    12

CodePudding user response：

By sorting the dataframe first and using the slice method which this approach gives me the result I expected.

df.sort_values(["group", "value"], ascending = False).groupby("group").slice(2, 7)

Output is

group   value
a   14
a   13
a   12
b   16
b   14
b   13
b   12
b   12