I'm trying to create an "aggregated" dictionary/JSON-like object inside a cell of a DataFrame where the keys are the values according to another column (for a particular group). Will illustrate this with an example below.
To begin, I'm starting with an table I've already called .explode
on.
Starting Table: start_df
child_id | parent_id | parent_name | tag_id | tag_name |
---|---|---|---|---|
12345 | 100 | 'Zack' | 99 | 'Builder' |
12345 | 101 | 'Cody' | 67 | 'Plumber' |
12345 | 102 | 'Jill' | 43 | 'Doctor' |
12345 | 102 | 'Jill' | 47 | 'Harvard' |
67890 | 108 | 'Emily' | 31 | 'Artist' |
67890 | 102 | 'Jill' | 43 | 'Doctor' |
67890 | 102 | 'Jill' | 47 | 'Harvard' |
For context, parents contain tags (properties) that a child will inherit. My goal is to create
Goal Table: output_df
child_id | parent_id_list | parent_desc_dict | full_tag_id_list | tag_ancestry |
---|---|---|---|---|
12345 | [100, 101, 102] | {100: 'Zack', 101: 'Cody', 102: 'Jill' } | [99, 67, 43, 47] | {100: [99], 101: [67], 102: [43, 47] } |
67890 | [102, 108] | {102: 'Jill', 108: 'Emily' } | [43, 47, 31] | {102: [43, 47], 108: [31] } |
For the lists parent_id_list
and full_tag_id_list
, I learned from here that we can do something like
start_df.groupby([collection_id], as_index = False).agg({'parent_id': list, 'tag_id': list})
...but how do we generate the dictionaries parent_desc_dict
and tag_ancestry
? Simply replacing list
with dict
won't do the job since that takes the row number as the dictionary's Key instead of the parent_id
. I figured the way to do this is with lambda functions somehow but not sure how to integrate it with the .agg
.
CodePudding user response:
Let us use custom agg
func with groupby apply
:
def agg(g):
return pd.Series({
'parent_id_list': [*g['parent_id'].unique()],
'parent_desc_dict': dict(zip(g['parent_id'], g['parent_name'])),
'full_tag_id_list': [*g['tag_id']],
'tag_ancestry': g.groupby('parent_id')['tag_id'].agg(list).to_dict()
})
df.groupby('child_id').apply(agg)
parent_id_list parent_desc_dict full_tag_id_list tag_ancestry
child_id
12345 [100, 101, 102] {100: 'Zack', 101: 'Cody', 102: 'Jill'} [99, 67, 43, 47] {100: [99], 101: [67], 102: [43, 47]}
67890 [108, 102] {108: 'Emily', 102: 'Jill'} [31, 43, 47] {102: [43, 47], 108: [31]}