In pandas, how I will first do group_by then sum then take the top two based on sum?
I tried many thing including the following one.
df.groupby(['A','B'])['C'].sum().groupby(['A']).nlargest(2,'C')
df.groupby(['A','B'])['C'].sum().groupby(['A']).apply(lambda x:nlargest(2))
I want to get the sum of C based on combination A and B, then filter out top two based on sum within group A.
Thank you.
Here is the data:
A B C
Alabama a 100
Alabama b 50
Alabama c 40
Alabama d 5
Alabama e 1
...
Wyoming a.51 180
Wyoming b.51 150
Wyoming c.51 56
Wyoming d.51 5
CodePudding user response:
Based on your sample data, you can try:
(df.groupby(['A', 'B'], as_index=False)['C'].sum()
.groupby('A')['C'].nlargest(2)
.droplevel(1)
)
Data Input:
A B C
0 Alabama a 100
1 Alabama b 50
2 Alabama c 40
3 Alabama d 5
4 Alabama e 1
5 Wyoming a.51 180
6 Wyoming b.51 150
7 Wyoming c.51 56
8 Wyoming d.51 5
Output:
A
Alabama 100
Alabama 50
Wyoming 180
Wyoming 150
Name: C, dtype: int64
Extended Test Cases
Let's try with more data to show the sums of the first groupby()
works and how it also works after grouped by A
again:
Data Input
A B C
0 Alabama a 100
1 Alabama b 50
2 Alabama b 250
3 Alabama c 40
4 Alabama d 5
5 Alabama d 355
6 Alabama e 1
7 Wyoming a.51 180
8 Wyoming b.51 150
9 Wyoming c.51 56
10 Wyoming c.51 556
11 Wyoming d.51 5
12 Wyoming d.51 820
Output
A
Alabama 360
Alabama 300
Wyoming 825
Wyoming 612
Name: C, dtype: int64
Edit
If you want to show all columns, you can use:
(df.groupby(['A','B'], as_index=False)['C'].sum()
.groupby(['A']).apply(lambda x: x.nlargest(2,'C'))
.reset_index(drop=True)
)
Data Input
A B C
0 Alabama a 100
1 Alabama b 50
2 Alabama b 250
3 Alabama c 40
4 Alabama d 5
5 Alabama d 355
6 Alabama e 1
7 Wyoming a.51 180
8 Wyoming b.51 150
9 Wyoming c.51 56
10 Wyoming c.51 556
11 Wyoming d.51 5
12 Wyoming d.51 820
Output
A B C
0 Alabama d 360
1 Alabama b 300
2 Wyoming d.51 825
3 Wyoming c.51 612
CodePudding user response:
You need to specify the column on which you're applying nlargest
after groupby
. Try:
>>> df.groupby(["A","B"]).sum().groupby("A")["C"].nlargest(2)
CodePudding user response:
I fixed this as follows, but still looking for better solutions.
df.groupby(['A','B'])['C'].sum().reset_index().groupby(['A']).apply(lambda x:nlargest(2,'C'))