Home > Blockchain >  pandas - show N highest counts in a group-by dataframe
pandas - show N highest counts in a group-by dataframe

Time:12-24

Here is my input DataFrame

df = pd.DataFrame([
 ['CA', 'LAX', 'Alice'],
 ['NY', 'NYC', 'Alice'],
 ['TX', 'HOU', 'Alice'],
 ['CA', 'SFO', 'Bob'], 
 ['NY', 'NYC', 'Bob'],
 ['TX', 'AUS', 'Bob'], 
 ['CA', 'LAX', 'Carol'],
 ['NY', 'ALB', 'Carol'], 
 ['TX', 'HOU', 'Carol'],
 ['CA', 'LAX', 'Dan'],
 ['NY', 'NYC', 'Dan'],
 ['TX', 'DAL', 'Dan'],
 ['CA', 'SFO', 'Eve'],
 ['NY', 'NYC', 'Eve'],
 ['TX', 'AUS', 'Eve'], 
 ['CA', 'SDO', 'Frank'],
 ['NY', 'NYC', 'Frank'],
 ['TX', 'HOU', 'Frank']], 
 columns=['State','City','Person']) 

And here is the output I wish to get:

State City  Count       
CA    LAX    3
      SFO    2
NY    ALB    1
      NYC    5
TX    HOU    3
      AUS    2

Basically, I want to show the top 2 cities having the highest count of people, for each state.

I have tried using this groupby statement: df.groupby(['State','City'])['Person'].count()

which generates below output

State  City         
CA     LAX    3
       SDO    1
       SFO    2
NY     ALB    1
       NYC    5
TX     AUS    2
       DAL    1
       HOU    3

But I am unable to come up with statements which could sort the above counts and pick the highest two counts. Would really appreciate the help!

CodePudding user response:

I think you can simplify the approach.

You can use groupby.value_counts(), instead of count(), to get where you are, and then chain another groupby().head(2) to get the top 2 Cities per each state.

The reason is that value_counts() resulting object will be in descending order, so that the first element is the most frequently-occurring element, so you can use head(2) as it is.

df.groupby('State')['City'].value_counts().groupby('State').head(2)

State  City
CA     LAX     3
       SFO     2
NY     NYC     5
       ALB     1
TX     HOU     3
       AUS     2
Name: City, dtype: int64

CodePudding user response:

Your first groupby was correct, after that you want to sort your values based on State and Count.

Then you group again solely on the state and fetch the head(2). If you want, you can (re)set your index to State and City.

df2 = df.groupby(['State','City'])['Person'].count().to_frame('Count')
df2 = df2.reset_index().sort_values(['State', 'Count'], ascending=[True, False])

df2.groupby('State').head(2).set_index(['State','City'])

Output:

                Count
State   City    
CA      LAX     3
        SFO     2
NY      NYC     5
        ALB     1
TX      HOU     3
        AUS     2
  • Related