Group by two columns and drop one in pandas-CodePudding

In pandas, I want to group by two columns and then do a count. After that I want to drop on of the columns I grouped on and calculate the average for the counts. I know how to do this in sql easily, but run into issues in Pandas. I can't drop one of the columns I grouped on previously. Do anyone know how to do this nicely?

(Just care about the end result, the procedure doesn't have to be this way if there is a better one)

Eg:

Name, City
Anna, New York
Carl, New York
Carl, New York
Steven, London
Carl, London
Anna, Paris
Carl, Paris
Carl, Paris

Group by 'Name' and 'City', then count:

Name, City, Count
Anna, New York, 1
Carl, New York, 2
Steven, London, 1
Carl, London, 1
Anna, Paris, 1
Carl, Paris, 2

Drop 'City' and take the average count for each 'Name':

Name, Count
Anna, 1
Carl, 1.66667
Steven, 1

CodePudding user response：

Use GroupBy.size for MultiIndex Series and then group by first level Name and aggregate mean, last create DataFrame by Series.reset_index:

df1 = df.groupby(['Name','City']).size().groupby(level=0).mean().reset_index(name='Count')
print (df1)
     Name     Count
0    Anna  1.000000
1    Carl  1.666667
2  Steven  1.000000

CodePudding user response：

Your operation is equivalent to computing a crosstab and getting the mean per row of the non-zero values:

out = (pd.crosstab(df['Name'], df['City'])
         .replace(0, float('nan')).mean(axis=1)
         .reset_index(name='Count') # optional
      )

output:

     Name     Count
0    Anna  1.000000
1    Carl  1.666667
2  Steven  1.000000