Is it possible to combine agg and value_counts in single line with Pandas-CodePudding

Given a df

   a  b  ngroup
0  1  3       0
1  1  4       0
2  1  1       0
3  3  7       2
4  4  4       2
5  1  1       4
6  2  2       4
7  1  1       4
8  6  6       5

I would like to compute the summation of multiple columns (i.e., a and b) grouped by the column ngroup.

In addition, I would like to count the number of element for each of the group.

Based on these two condition, the expected output as below

a   b   nrow_same_group ngroup
3   8   3               0
7   11  2               2
4   4   3               4
6   6   1               5

The following code should do the work

import pandas as pd

df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
                         [3,4,1,7,4,1,2,1,6,1],
                         [0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])


grouped_df = df.groupby(['ngroup'])
df1 = grouped_df[['a','b']].agg('sum').reset_index()


df2 = df['ngroup'].value_counts().reset_index()

df2.sort_values('index', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={'index':'ngroup','ngroup':'nrow_same_group'},inplace=True)
df= pd.merge(df1, df2, on=['ngroup'])

However, I wonder whether there exist built-in pandas that achieve something similar, in single line.

CodePudding user response：

First aggregate a, b with sum then calculate size of each group and assign this to nrow_same_group column

g = df.groupby('ngroup')
g.sum().assign(nrow_same_group=g.size())

        a   b  nrow_same_group
ngroup                        
0       3   8                3
2       7  11                2
4       4   4                3
5       6   6                1

CodePudding user response：

You can do it using only groupby agg.

import pandas as pd 

df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
                         [3,4,1,7,4,1,2,1,6,1],
                         [0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])

res = (
    df.groupby('ngroup', as_index=False)
      .agg(a=('a','sum'), b=('b', 'sum'), 
           nrow_same_group=('a', 'size'))
)

Here the parameters passed to agg are tuples whose first element is the column to aggregate and the second element is the aggregation function to apply to that column. The parameter names are the labels for the resulting columns.

Output:

>>> res

   ngroup  a   b  nrow_same_group
0       0  3   8                3
1       2  7  11                2
2       4  4   4                3
3       5  6   6                1