I've a sample dataframe
id mobile_number to_be_deleted
1 123 0
2 123 1
3 123 1
4 456 0
5 789 0
6 789 1
7 1234 0
8 789 1
Considering the rows that contain to_be_deleted
values as 0
, those record's ids will be parent_id, and the same mobile_number in anthoer records would be child's id.
I'm expecting to summarize the above dataframe to
parent_id child_id
1 2, 3
4 null
5 6, 8
7 null
What could be the best way to summarize the dataframe?
CodePudding user response:
You can use pd.crosstab
with join
as aggfunc
. Note that we also need to use astype(str)
so that we can concatenate numbers:
pd.crosstab(index=df['mobile_number'], columns = df['to_be_deleted'].astype(str),\
values=df['id'].astype(str), aggfunc= lambda x: ', '.join(x))
Output:
to_be_deleted 0 1
mobile_number
123 1 2, 3
1234 7 NaN
456 4 NaN
789 5 6, 8
CodePudding user response:
You can do two aggs for to_be_deleted == 0, and for == 1 and then outer merge them.
childs = df[df.to_be_deleted == 1].groupby('mobile_number').agg(child_id=('id', list))
parents = df[df.to_be_deleted == 0].groupby('mobile_number').agg(parent_id=('id', 'first'))
out = parents.merge(childs, on='mobile_number', how='outer')
print(out):
mobile_number parent_id child_id
0 123 1 [2, 3]
1 456 4 NaN
2 789 5 [6, 8]
3 1234 7 NaN
CodePudding user response:
One way to do it is as follows.
We start by applying the following custom function
df_new = df[df['to_be_deleted'] == 0].groupby('mobile_number')['id'].apply(lambda x: ','.join(x.astype(str))).reset_index()
[Out]:
mobile_number id
0 123 1
1 456 4
2 789 5
3 1234 7
And to obtain the correct parent_id
column there's still one thing to do: rename column id
to parent_id
. For that one can use pandas.DataFrame.rename
as follows
df_new.rename(columns={'id': 'parent_id'}, inplace=True)
[Out]:
mobile_number parent_id
0 123 1
1 456 4
2 789 5
3 1234 7
Now, with the parent_id
ready, all one has to do is to tweak the mobile_number
column to obtain the child_id
. For that one can use the following custom function
df_new['child_id'] = df_new['mobile_number'].apply(lambda x: ','.join(df[(df['mobile_number'] == x) & (df['to_be_deleted'] == 1)]['id'].astype(str)))
[Out]:
mobile_number parent_id child_id
0 123 1 2,3
1 456 4
2 789 5 6,8
3 1234 7
Finally, as one doesn't want the column mobile_number
, one can remove it with pandas.DataFrame.pop
as
df_new.pop('mobile_number')
[Out]:
parent_id child_id
0 1 2,3
1 4
2 5 6,8
3 7
There's room to wrap everything in a function as follows
def summarization(df):
df_new = df[df['to_be_deleted'] == 0].groupby('mobile_number')['id'].apply(lambda x: ','.join(x.astype(str))).reset_index()
df_new.rename(columns={'id': 'parent_id'}, inplace=True)
df_new['child_id'] = df_new['mobile_number'].apply(lambda x: ','.join(df[(df['mobile_number'] == x) & (df['to_be_deleted'] == 1)]['id'].astype(str)))
df_new.pop('mobile_number')
return df_new
And then all one has to do is to apply the function to the dataframe
df_new = summarization(df)
[Out]:
parent_id child_id
0 1 2,3
1 4
2 5 6,8
3 7