Consider the following table.
-------- ----------- -------------
| group | numerical | categorical |
-------- ----------- -------------
| group1 | 23 | cat1 |
| group1 | 31 | cat1 |
| group1 | 10 | cat2 |
| group2 | 23 | cat1 |
| group2 | 51 | cat2 |
| group2 | 03 | cat2 |
-------- ----------- -------------
I want to group by the column group
and get the following output table which gives the count
of each category in the categorical
column as individual columns.
-------- --------------- ----------------------- -----------------------
| group | numerical_sum | categorical_cat1_size | categorical_cat2_size |
-------- --------------- ----------------------- -----------------------
| group1 | 64 | 2 | 1 |
| group2 | 77 | 1 | 2 |
-------- --------------- ----------------------- -----------------------
For the numerical column we can aggregate and sum
, but how can we get the value count of each category in the categorical
column as a separate column?
Note that using pd.value_counts
doesn't help as it gives the following output.
pd.groupby(by='group').agg({
'numerical': 'sum',
'categorical': pd.value_counts
})
-------- --------------- -------------
| group | numerical_sum | categorical |
-------- --------------- -------------
| group1 | 64 | [2,1] |
| group2 | 67 | [1,2] |
-------- --------------- -------------
CodePudding user response:
Combine aggregation sum
with crosstab
by DataFrame.join
:
df = (df.groupby(by='group', as_index=False)
.agg({'numerical': 'sum'})
.join(pd.crosstab(df['group'], df['categorical']), on='group'))
print (df)
group numerical cat1 cat2
0 group1 64 2 1
1 group2 77 1 2
CodePudding user response:
You can use:
out = pd.concat([df.groupby('group')['numerical'].sum().rename('numerical_sum'),
df.value_counts(['group', 'categorical']).unstack()
.add_prefix('categorical_').add_suffix('_size')],
axis=1).reset_index()
print(out)
# Output
group numerical_sum categorical_cat1_size categorical_cat2_size
0 group1 64 2 1
1 group2 77 1 2