I have extracted a table from a database and wish to do some topic analysis on some entries. I have created an empty matrix with unique topic names and I have duplicate rows because there are potentially multiple topics associated with each 'name' entry. Ultimately, I would like a dataframe that has 1's across the row where a topic was associated with it. I will then remove the 'topic label' column, and at some point remove duplicate rows. The actual dataframe is much larger, but here I am just showing an illustration.
Here is my data:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 0 0 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 0 0 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 0 0 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 0 0 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 0
And I would like it to look like this:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 1 1 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 1 1 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 1 1 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 1 1 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 1
I have tried to use .loc
in a loop to first slice the data by name, and then assign the values (after setting name as index), but this doesn't work when a row is unique:
name_set = list(set(df['name']))
df = df.set_index('name')
for i in name_set:
df.loc[i, list(df.loc[i]['topic_label'])] = 1
I feel like I am going round in circles here... is there a better way to do this?
CodePudding user response:
One option is to use get_dummies
to the dummy variables for each topic_label
; then call sum
in groupby.transform
to aggregate the dummy variables for names:
cols = df['topic_label'].tolist()
out = df.drop(columns=cols).join(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(out)
The above returns a new DataFrame out
. If you want to update df
instead, then you can use update
:
df.update(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(df)
Output:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 1 1 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 1 1 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 1 1 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 1 1 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 1