How to use `groupby` to aggregate columns into dictionary so that new column contains that dictionar-CodePudding

I have a dataframe that contains a person, year, and a bunch of flag containing columns, like below:

# sample dataframe
data = [["John Doe", 2018, True, False, True], ["Jane Doe", 2019, True, False, False]]

df = pd.DataFrame(data, columns=["person", "year", "flag_1", "flag_2", "flag_3"])
df

     person     year   flag_1   flag_2  flag_3
0   John Doe    2018    True    False   True
1   Jane Doe    2019    True    False   False

I would like my final output to be a groupby where the person and year are retained, and a new column containing a dictionary of all the flag results is stored.

Something kind of like, but not quite like, this:

First: reshape

reshaped_df = pd.melt(
    df, id_vars=["person", "year"], value_vars=["flag_1", "flag_2", "flag_3"]
)

     person     year    variable value
0   John Doe    2018    flag_1   True
1   Jane Doe    2019    flag_1   True
2   John Doe    2018    flag_2   False
3   Jane Doe    2019    flag_2   False
4   John Doe    2018    flag_3   True

Second: Create dictionary

reshaped_df.set_index(["person", "year", "variable"]).T.to_dict("list")

{('John Doe', 2018, 'flag_1'): [True],
 ('Jane Doe', 2019, 'flag_1'): [True],
 ('John Doe', 2018, 'flag_2'): [False],
 ('Jane Doe', 2019, 'flag_2'): [False],
 ('John Doe', 2018, 'flag_3'): [True],
 ('Jane Doe', 2019, 'flag_3'): [False]}

except I want my output to look like this:

    person      year    flag_dict
0   John Doe    2018    {'flag_1': True, 'flag_2': False, 'flag_3': True}
1   Jane Doe    2019    {'flag_1': True, 'flag_2': False, 'flag_3': False}

Is this possible? If so how can it be done? Thank you!

CodePudding user response：

It is possible to reshape the dataframe and create a new column containing a dictionary of all the flag results as described. Here's a way to do it:

# Reshape the dataframe
reshaped_df = pd.melt(
    df, id_vars=["person", "year"], value_vars=["flag_1", "flag_2", "flag_3"]
)

# Create a dictionary for each group of person, year
reshaped_df = reshaped_df.groupby(["person", "year"]).apply(lambda x: x.set_index("variable")["value"].to_dict()).reset_index()

# Rename the last column
reshaped_df.rename(columns={0: "flag_dict"}, inplace=True)

print(reshaped_df)

You should now have a dataframe that looks like this:

     person  year                                        flag_dict
0  John Doe  2018  {'flag_1': True, 'flag_2': False, 'flag_3': True}
1  Jane Doe  2019  {'flag_1': True, 'flag_2': False, 'flag_3': False

}

Hope this helps!

CodePudding user response：

No need for a complex reshape, simply use to_dict


cols = list (df.filter(like='flag_'))

out = (
 df.drop(columns=cols)
   .assign(flag_dict=df[cols].to_dict('records'))
 )

Output:

     person  year                                           flag_dict
0  John Doe  2018   {'flag_1': True, 'flag_2': False, 'flag_3': True}
1  Jane Doe  2019  {'flag_1': True, 'flag_2': False, 'flag_3': False}