Home > Net >  Pandas: groupby sum then nlargest
Pandas: groupby sum then nlargest

Time:09-16

In pandas, how I will first do group_by then sum then take the top two based on sum?

I tried many thing including the following one.

df.groupby(['A','B'])['C'].sum().groupby(['A']).nlargest(2,'C')
df.groupby(['A','B'])['C'].sum().groupby(['A']).apply(lambda x:nlargest(2))

I want to get the sum of C based on combination A and B, then filter out top two based on sum within group A.

Thank you.

Here is the data:

A        B         C 
Alabama  a         100
Alabama  b         50
Alabama  c         40
Alabama  d         5
Alabama  e         1
...
Wyoming  a.51      180
Wyoming  b.51      150
Wyoming  c.51      56
Wyoming  d.51      5

CodePudding user response:

Based on your sample data, you can try:

(df.groupby(['A', 'B'], as_index=False)['C'].sum()
   .groupby('A')['C'].nlargest(2)
   .droplevel(1)
)

Data Input:

         A     B    C
0  Alabama     a  100
1  Alabama     b   50
2  Alabama     c   40
3  Alabama     d    5
4  Alabama     e    1
5  Wyoming  a.51  180
6  Wyoming  b.51  150
7  Wyoming  c.51   56
8  Wyoming  d.51    5

Output:

A
Alabama    100
Alabama     50
Wyoming    180
Wyoming    150
Name: C, dtype: int64

Extended Test Cases

Let's try with more data to show the sums of the first groupby() works and how it also works after grouped by A again:

Data Input

          A     B    C
0   Alabama     a  100
1   Alabama     b   50
2   Alabama     b  250
3   Alabama     c   40
4   Alabama     d    5
5   Alabama     d  355
6   Alabama     e    1
7   Wyoming  a.51  180
8   Wyoming  b.51  150
9   Wyoming  c.51   56
10  Wyoming  c.51  556
11  Wyoming  d.51    5
12  Wyoming  d.51  820

Output

A
Alabama    360
Alabama    300
Wyoming    825
Wyoming    612
Name: C, dtype: int64

Edit

If you want to show all columns, you can use:

(df.groupby(['A','B'], as_index=False)['C'].sum()
   .groupby(['A']).apply(lambda x: x.nlargest(2,'C'))
   .reset_index(drop=True)
)

Data Input

          A     B    C
0   Alabama     a  100
1   Alabama     b   50
2   Alabama     b  250
3   Alabama     c   40
4   Alabama     d    5
5   Alabama     d  355
6   Alabama     e    1
7   Wyoming  a.51  180
8   Wyoming  b.51  150
9   Wyoming  c.51   56
10  Wyoming  c.51  556
11  Wyoming  d.51    5
12  Wyoming  d.51  820

Output

         A     B    C
0  Alabama     d  360
1  Alabama     b  300
2  Wyoming  d.51  825
3  Wyoming  c.51  612

CodePudding user response:

You need to specify the column on which you're applying nlargest after groupby. Try:

>>> df.groupby(["A","B"]).sum().groupby("A")["C"].nlargest(2)

CodePudding user response:

I fixed this as follows, but still looking for better solutions.

df.groupby(['A','B'])['C'].sum().reset_index().groupby(['A']).apply(lambda x:nlargest(2,'C'))
  • Related