In pandas, I want to group by two columns and then do a count. After that I want to drop on of the columns I grouped on and calculate the average for the counts. I know how to do this in sql easily, but run into issues in Pandas. I can't drop one of the columns I grouped on previously. Do anyone know how to do this nicely?
(Just care about the end result, the procedure doesn't have to be this way if there is a better one)
Eg:
Name, City
Anna, New York
Carl, New York
Carl, New York
Steven, London
Carl, London
Anna, Paris
Carl, Paris
Carl, Paris
Group by 'Name' and 'City', then count:
Name, City, Count
Anna, New York, 1
Carl, New York, 2
Steven, London, 1
Carl, London, 1
Anna, Paris, 1
Carl, Paris, 2
Drop 'City' and take the average count for each 'Name':
Name, Count
Anna, 1
Carl, 1.66667
Steven, 1
CodePudding user response:
Use GroupBy.size
for MultiIndex Series
and then group by first level Name
and aggregate mean
, last create DataFrame by Series.reset_index
:
df1 = df.groupby(['Name','City']).size().groupby(level=0).mean().reset_index(name='Count')
print (df1)
Name Count
0 Anna 1.000000
1 Carl 1.666667
2 Steven 1.000000
CodePudding user response:
Your operation is equivalent to computing a crosstab
and getting the mean
per row of the non-zero values:
out = (pd.crosstab(df['Name'], df['City'])
.replace(0, float('nan')).mean(axis=1)
.reset_index(name='Count') # optional
)
output:
Name Count
0 Anna 1.000000
1 Carl 1.666667
2 Steven 1.000000