Home > front end >  pandas groupby with many categories and sort them by value
pandas groupby with many categories and sort them by value

Time:06-08

I have the following data

country goodCode value
CN 27000400 12 000 000
KZ 15000000 23 000 000
... ... ...

overall there are over 150 different countries and over 12000 different goods (i.e. distinct goodCode)

How I can generate the following table :

country goodCode Totalvalue
CN 27000400 400 000 000
15000000 100 000 000
... ... ...
KZ 15000000 244 000 000
27000400 130 000 000

Basically I want to group entire dataset by country and then by goodCode and then display it as for each country top good (i.e. goodCode) and total sum of this good.

When i applied df.groupby(["country" ,"goodCode"]).sum().reset_index() i had the following table

country goodCode Totalvalue
CN 27000400 400 000 000
CN 15000000 100 000 000
CN 45000000 700 000 000
CN 65000000 1 700 000 000
... ... ...
KZ 15000000 244 000 000
KZ 27000400 130 000 000
KZ 37000400 530 000 000

It is 1. unsorted 2. Country name in each row.

When I applied sort_values(by=["Totalvalue"], ascending = False) The DataFrame messed up. Countries mixed but sorted by value. It was something like:

country goodCode Totalvalue
CN 27000400 1 700 000 000
KZ 15000000 700 000 000
AN 45000000 200 000 000
CN 65000000 100 000 000
... ... ...
CA 15000000 50 000 000
AE 27000400 25 000 000
KZ 37000400 20 000 000

Can anyone help?

CodePudding user response:

You can use:

out = (df.groupby(['country', 'goodCode'], as_index=False).sum()
         .sort_values(['country', 'Totalvalue'], ascending=[True, False]))

For example:

>>> df
  country  goodCode  Totalvalue
0      CN  27000400  1700000000
1      KZ  15000000   700000000
2      AN  45000000   200000000
3      CN  65000000   100000000
4      CA  15000000    50000000
5      AE  27000400    25000000
6      KZ  37000400    20000000

>>> df.sort_values(['country', 'Totalvalue'], ascending=[True, False])
  country  goodCode  Totalvalue
5      AE  27000400    25000000
2      AN  45000000   200000000
4      CA  15000000    50000000
0      CN  27000400  1700000000
3      CN  65000000   100000000
1      KZ  15000000   700000000
6      KZ  37000400    20000000

CodePudding user response:

The first part is good, after that, you can do sort the values using df.sort_values(by=["country", "Totalvalue"]) which will sort by country first, then within the same country to sort by Totalvalue

df = df.groupby(["country" ,"goodCode"]).sum().reset_index()

df = df.sort_values(by=["country", "Totalvalue"], ascending=[True, False])
  • Related