Home > database >  Concatenating lists across different rows within a dataframe according to another column
Concatenating lists across different rows within a dataframe according to another column

Time:07-19

Would like to create a nested list of lists within one table according to a matching key in another table. To illustrate this, I've anonymized the PII data and built a small example of what I'm trying to do.


I'm starting with a table that lists all parent_id's for a given child_id stored as a list within a cell in the parent_id_list column of a Pandas dataframe. For context, parents contain tags that a child will inherit.

Table 1 (child_parent_relations_df)

child_id parent_id_list
12345 [100, 101, 102]
67890 [102, 108]

In another table that has had .explode applied to it and been joined to another table that maps parent_id to the tags that they contain (as a list in parent_tag_ids).

Table 2 (parent_id_with_tags_df) enter image description here

My goal is to create a clean table that has the parent_id kept as a list (as in Table 1) but now to condense the tag_id information in Table 2 into a nested list.

Goal Table

child_id parent_id_list inherited_tag_list
12345 [100, 101, 102] [[99], [67], [43, 47, 48]]
67890 [102, 108] [[43, 47, 48], [31, 32, 33, 34]]

I've thought about how to do this "from scratch" using loops but it seems really inefficient to have to do this for each child and then for each parent, concatenate the list of tags. Is there an easy way to do this using pandas functions?

CodePudding user response:

You can explode child_parent_relations_df parent_id_list column to rows then merge parent_tag_ids column from parent_id_with_tags_df and groupby.agg

child_parent_relations_df = child_parent_relations_df.explode('parent_id_list', ignore_index=True)
out = child_parent_relations_df.merge(parent_id_with_tags_df,
                                      left_on=['child_id', 'parent_id_list'],
                                      right_on=['child_id', 'parent_id'],
                                      how='left')
out = out.groupby(['child_id'], as_index=False).agg({'parent_id_list': list, 'parent_tag_ids': list})
  • Related