Home > Software design >  find parent id and child id based on condition
find parent id and child id based on condition

Time:09-27

I've a sample dataframe

id    mobile_number    to_be_deleted
 1       123                0
 2       123                1
 3       123                1
 4       456                0
 5       789                0
 6       789                1
 7       1234               0
 8       789                1

Considering the rows that contain to_be_deleted values as 0, those record's ids will be parent_id, and the same mobile_number in anthoer records would be child's id.

I'm expecting to summarize the above dataframe to

parent_id       child_id
    1              2, 3
    4              null
    5              6, 8
    7              null

What could be the best way to summarize the dataframe?

CodePudding user response:

You can use pd.crosstab with join as aggfunc. Note that we also need to use astype(str) so that we can concatenate numbers:

pd.crosstab(index=df['mobile_number'], columns = df['to_be_deleted'].astype(str),\
       values=df['id'].astype(str), aggfunc= lambda x: ', '.join(x))

Output:

       to_be_deleted    0   1
mobile_number       
123                    1    2, 3
1234                   7    NaN
456                    4    NaN
789                    5    6, 8

CodePudding user response:

You can do two aggs for to_be_deleted == 0, and for == 1 and then outer merge them.

childs = df[df.to_be_deleted == 1].groupby('mobile_number').agg(child_id=('id', list))
parents = df[df.to_be_deleted == 0].groupby('mobile_number').agg(parent_id=('id', 'first'))
out = parents.merge(childs, on='mobile_number', how='outer')

print(out):

   mobile_number  parent_id child_id
0            123          1   [2, 3]
1            456          4      NaN
2            789          5   [6, 8]
3           1234          7      NaN

CodePudding user response:

One way to do it is as follows.

We start by applying the following custom function

df_new = df[df['to_be_deleted'] == 0].groupby('mobile_number')['id'].apply(lambda x: ','.join(x.astype(str))).reset_index()

[Out]:
   mobile_number id
0            123  1
1            456  4
2            789  5
3           1234  7

And to obtain the correct parent_id column there's still one thing to do: rename column id to parent_id. For that one can use pandas.DataFrame.rename as follows

df_new.rename(columns={'id': 'parent_id'}, inplace=True)

[Out]:
   mobile_number parent_id
0            123         1
1            456         4
2            789         5
3           1234         7

Now, with the parent_id ready, all one has to do is to tweak the mobile_number column to obtain the child_id. For that one can use the following custom function

df_new['child_id'] = df_new['mobile_number'].apply(lambda x: ','.join(df[(df['mobile_number'] == x) & (df['to_be_deleted'] == 1)]['id'].astype(str)))

[Out]:
   mobile_number parent_id child_id
0            123         1      2,3
1            456         4         
2            789         5      6,8
3           1234         7         

Finally, as one doesn't want the column mobile_number, one can remove it with pandas.DataFrame.pop as

df_new.pop('mobile_number')

[Out]:
  parent_id child_id
0         1      2,3
1         4         
2         5      6,8
3         7         

There's room to wrap everything in a function as follows

def summarization(df):

    df_new = df[df['to_be_deleted'] == 0].groupby('mobile_number')['id'].apply(lambda x: ','.join(x.astype(str))).reset_index()

    df_new.rename(columns={'id': 'parent_id'}, inplace=True)

    df_new['child_id'] = df_new['mobile_number'].apply(lambda x: ','.join(df[(df['mobile_number'] == x) & (df['to_be_deleted'] == 1)]['id'].astype(str)))

    df_new.pop('mobile_number')

    return df_new

And then all one has to do is to apply the function to the dataframe

df_new = summarization(df)

[Out]:
  parent_id child_id
0         1      2,3
1         4         
2         5      6,8
3         7         
  • Related