I have a df like this:
parent | child |
---|---|
Susie | jose |
Susie | bob |
Susie | bob |
Susie | frank |
Allen | bob |
Allen | frank |
and I want it to look like this:
Parent | child_jose | child_bob | child_frank |
---|---|---|---|
Susie | 1 | 2 | 1 |
Allen | 0 | 1 | 1 |
which is to just take the count of each child the parent has and make a column for that child with the count of how many kids each parent has
I used this code to first group by parent name:
''' by_parent = df.groupby(["ParentName", "ChildName"])['ChildName'].count().to_frame() '''
Which looks right when I get the output, except that the count column is named 'ChildName'.
So after this code it looks like
ParentName | ChildName | ChildName |
---|---|---|
Susie | jose | 1 |
bob | 2 | |
frank | 1 | |
Allen | bob | 1 |
frank | 1 | |
jose | 0 |
I have tried a few of the similar solutions on here but am getting no luck with getting the column names straight on this. After using to_frame() it looks right but when I call df.columns, only 'ChildName' is shown. Any help with how to get this first step down and then possibly help with pivoting off of the childName column would be super helpful. Thanks in advance and please comment if you need additional clarification.
CodePudding user response:
crosstab
is nice here as well, though it is more of a convenience tool (might be slow depending on dataset size):
(pd.crosstab(df.parent, df.child)
.add_prefix('child_')
.reset_index()
.rename_axis(columns=None)
)
parent child_bob child_frank child_jose
0 Allen 1 1 0
1 Susie 2 1 1
CodePudding user response:
Use unstack
after your groupby_count
:
>>> df.groupby(["parent", "child"])['child'].count().unstack() \
.add_prefix('child_').rename_axis(index=None, columns=None) \
.fillna(0).astype(int)
child_bob child_frank child_jose
Allen 1 1 0
Susie 2 1 1
CodePudding user response:
Well, just for showing another approach, because it's clear there are other better answers here.
You can group the data by parent and child, and then, you can iterate on the original dataframe and create the columns you need. Using get_group()
count()
methods, you get the count of that particular group.
grouped_data = df.groupby(['parent', 'child'])['child']
for _, row in df.iterrows():
df.loc[(df['parent'] == row.parent) & (df['child'] == row.child),
f'child_{row.child}'] = grouped_data.get_group((row.parent, row.child)).count()
At this point, your dataframe looks as follow:
parent | child | child_jose | child_bob | child_frank | |
---|---|---|---|---|---|
0 | Susie | jose | 1.0 | NaN | NaN |
1 | Susie | bob | NaN | 2.0 | NaN |
2 | Susie | bob | NaN | 2.0 | NaN |
3 | Susie | frank | NaN | NaN | 1.0 |
4 | Allen | bob | NaN | 1.0 | NaN |
5 | Allen | frank | NaN | NaN | 1.0 |
Finally, we delete the child
column, we delete duplicates, and we group by parent
to get the sum()
df = df.drop_duplicates().drop(columns='child')
df = df.groupby(['parent']).sum().reset_index()
Output:
parent | child_jose | child_bob | child_frank | |
---|---|---|---|---|
0 | Allen | 0.0 | 1.0 | 1.0 |
1 | Susie | 1.0 | 2.0 | 1.0 |