Fill matrix in pandas by grouping rows-CodePudding

I have extracted a table from a database and wish to do some topic analysis on some entries. I have created an empty matrix with unique topic names and I have duplicate rows because there are potentially multiple topics associated with each 'name' entry. Ultimately, I would like a dataframe that has 1's across the row where a topic was associated with it. I will then remove the 'topic label' column, and at some point remove duplicate rows. The actual dataframe is much larger, but here I am just showing an illustration.

Here is my data:

    topic_label               name                                              Misconceptions  Long-term health issues Reproductive disease    Inadequate research Unconscious bias
0   Misconceptions            When is menstrual bleeding too much?              0   0   0   0   0
1   Long-term health issues   When is menstrual bleeding too much?              0   0   0   0   0
2   Reproductive disease      10% of reproductive age women have endometriosis  0   0   0   0   0
3   Inadequate research       10% of reproductive age women have endometriosis  0   0   0   0   0
4   Unconscious bias          Male bias threatens women's health                0   0   0   0   0

And I would like it to look like this:

    topic_label               name                                              Misconceptions  Long-term health issues Reproductive disease    Inadequate research Unconscious bias
0   Misconceptions            When is menstrual bleeding too much?              1   1   0   0   0
1   Long-term health issues   When is menstrual bleeding too much?              1   1   0   0   0
2   Reproductive disease      10% of reproductive age women have endometriosis  0   0   1   1   0
3   Inadequate research       10% of reproductive age women have endometriosis  0   0   1   1   0
4   Unconscious bias          Male bias threatens women's health                0   0   0   0   1

I have tried to use .loc in a loop to first slice the data by name, and then assign the values (after setting name as index), but this doesn't work when a row is unique:

name_set = list(set(df['name']))
df = df.set_index('name')

for i in name_set:
    df.loc[i, list(df.loc[i]['topic_label'])] = 1

I feel like I am going round in circles here... is there a better way to do this?

CodePudding user response：

One option is to use get_dummies to the dummy variables for each topic_label; then call sum in groupby.transform to aggregate the dummy variables for names:

cols = df['topic_label'].tolist()
out = df.drop(columns=cols).join(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(out)

The above returns a new DataFrame out. If you want to update df instead, then you can use update:

df.update(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(df)

Output:

               topic_label                                              name  Misconceptions  Long-term health issues  Reproductive disease  Inadequate research  Unconscious bias
0           Misconceptions              When is menstrual bleeding too much?               1                        1                     0                    0                 0
1  Long-term health issues              When is menstrual bleeding too much?               1                        1                     0                    0                 0
2     Reproductive disease  10% of reproductive age women have endometriosis               0                        0                     1                    1                 0
3      Inadequate research  10% of reproductive age women have endometriosis               0                        0                     1                    1                 0
4         Unconscious bias                Male bias threatens women's health               0                        0                     0                    0                 1