Issue with renaming columns after using .groupby()-CodePudding

I have a df like this:

parent	child
Susie	jose
Susie	bob
Susie	bob
Susie	frank
Allen	bob
Allen	frank

and I want it to look like this:

Parent	child_jose	child_bob	child_frank
Susie	1	2	1
Allen	0	1	1

which is to just take the count of each child the parent has and make a column for that child with the count of how many kids each parent has

I used this code to first group by parent name:

''' by_parent = df.groupby(["ParentName", "ChildName"])['ChildName'].count().to_frame() '''

Which looks right when I get the output, except that the count column is named 'ChildName'.

So after this code it looks like

ParentName	ChildName	ChildName
Susie	jose	1
	bob	2
	frank	1
Allen	bob	1
	frank	1
	jose	0

I have tried a few of the similar solutions on here but am getting no luck with getting the column names straight on this. After using to_frame() it looks right but when I call df.columns, only 'ChildName' is shown. Any help with how to get this first step down and then possibly help with pivoting off of the childName column would be super helpful. Thanks in advance and please comment if you need additional clarification.

CodePudding user response：

crosstab is nice here as well, though it is more of a convenience tool (might be slow depending on dataset size):

(pd.crosstab(df.parent, df.child)
   .add_prefix('child_')
   .reset_index()
   .rename_axis(columns=None)
 )
  parent  child_bob  child_frank  child_jose
0  Allen          1            1           0
1  Susie          2            1           1

CodePudding user response：

Use unstack after your groupby_count:

>>> df.groupby(["parent", "child"])['child'].count().unstack() \
      .add_prefix('child_').rename_axis(index=None, columns=None) \
      .fillna(0).astype(int)

        child_bob  child_frank  child_jose
Allen           1            1           0
Susie           2            1           1

CodePudding user response：

Well, just for showing another approach, because it's clear there are other better answers here.

You can group the data by parent and child, and then, you can iterate on the original dataframe and create the columns you need. Using get_group() count() methods, you get the count of that particular group.

grouped_data = df.groupby(['parent', 'child'])['child']
for _, row in df.iterrows():
  df.loc[(df['parent'] == row.parent) & (df['child'] == row.child),
         f'child_{row.child}'] = grouped_data.get_group((row.parent, row.child)).count()

At this point, your dataframe looks as follow:

	parent	child	child_jose	child_bob	child_frank
0	Susie	jose	1.0	NaN	NaN
1	Susie	bob	NaN	2.0	NaN
2	Susie	bob	NaN	2.0	NaN
3	Susie	frank	NaN	NaN	1.0
4	Allen	bob	NaN	1.0	NaN
5	Allen	frank	NaN	NaN	1.0

Finally, we delete the child column, we delete duplicates, and we group by parent to get the sum()

df = df.drop_duplicates().drop(columns='child')
df = df.groupby(['parent']).sum().reset_index()

Output:

	parent	child_jose	child_bob	child_frank
0	Allen	0.0	1.0	1.0
1	Susie	1.0	2.0	1.0