Home > database >  Issue with renaming columns after using .groupby()
Issue with renaming columns after using .groupby()

Time:11-11

I have a df like this:

parent child
Susie jose
Susie bob
Susie bob
Susie frank
Allen bob
Allen frank

and I want it to look like this:

Parent child_jose child_bob child_frank
Susie 1 2 1
Allen 0 1 1

which is to just take the count of each child the parent has and make a column for that child with the count of how many kids each parent has

I used this code to first group by parent name:

''' by_parent = df.groupby(["ParentName", "ChildName"])['ChildName'].count().to_frame() '''

Which looks right when I get the output, except that the count column is named 'ChildName'.

So after this code it looks like

ParentName ChildName ChildName
Susie jose 1
bob 2
frank 1
Allen bob 1
frank 1
jose 0

I have tried a few of the similar solutions on here but am getting no luck with getting the column names straight on this. After using to_frame() it looks right but when I call df.columns, only 'ChildName' is shown. Any help with how to get this first step down and then possibly help with pivoting off of the childName column would be super helpful. Thanks in advance and please comment if you need additional clarification.

CodePudding user response:

crosstab is nice here as well, though it is more of a convenience tool (might be slow depending on dataset size):

(pd.crosstab(df.parent, df.child)
   .add_prefix('child_')
   .reset_index()
   .rename_axis(columns=None)
 )
  parent  child_bob  child_frank  child_jose
0  Allen          1            1           0
1  Susie          2            1           1

CodePudding user response:

Use unstack after your groupby_count:

>>> df.groupby(["parent", "child"])['child'].count().unstack() \
      .add_prefix('child_').rename_axis(index=None, columns=None) \
      .fillna(0).astype(int)

        child_bob  child_frank  child_jose
Allen           1            1           0
Susie           2            1           1

CodePudding user response:

Well, just for showing another approach, because it's clear there are other better answers here.

You can group the data by parent and child, and then, you can iterate on the original dataframe and create the columns you need. Using get_group() count() methods, you get the count of that particular group.

grouped_data = df.groupby(['parent', 'child'])['child']
for _, row in df.iterrows():
  df.loc[(df['parent'] == row.parent) & (df['child'] == row.child),
         f'child_{row.child}'] = grouped_data.get_group((row.parent, row.child)).count()

At this point, your dataframe looks as follow:

parent child child_jose child_bob child_frank
0 Susie jose 1.0 NaN NaN
1 Susie bob NaN 2.0 NaN
2 Susie bob NaN 2.0 NaN
3 Susie frank NaN NaN 1.0
4 Allen bob NaN 1.0 NaN
5 Allen frank NaN NaN 1.0

Finally, we delete the child column, we delete duplicates, and we group by parent to get the sum()

df = df.drop_duplicates().drop(columns='child')
df = df.groupby(['parent']).sum().reset_index()

Output:

parent child_jose child_bob child_frank
0 Allen 0.0 1.0 1.0
1 Susie 1.0 2.0 1.0
  • Related