Suppose I have the following list of labels,
labs = ['G1','G2','G3','G4','G5','G6','G7']
and also suppose that I have the following df:
group entity_label
0 0 G1
1 0 G2
3 1 G5
4 1 G1
5 2 G1
6 2 G2
7 2 G3
to produce the above df you can use:
df_test = pd.DataFrame({'group': [0,0,0,1,1,2,2,2,2],
'entity_label':['G1','G2','G2','G5','G1','G1','G2','G3','G3']})
df_test.drop_duplicates(subset=['group','entity_label'], keep='first')
for each group I want to use a mapping to look up on the labels and make a new dataframe with binary labels
group entity_label_binary
0 0 [1, 1, 0, 0, 0, 0, 0]
1 1 [1, 0, 0, 0, 1, 0, 0]
2 2 [1, 1, 1, 0, 0, 0, 0]
namely for group 0 we have G1 and G2 hence 1s in above table and so on. I wonder how one can do this?
CodePudding user response:
One option, based on crosstab
:
labs = ['G1','G2','G3','G4','G5','G6','G7']
(pd.crosstab(df_test['group'], df_test['entity_label'])
.clip(upper=1)
.reindex(columns=labs, fill_value=0)
.agg(list, axis=1)
.reset_index(name='entity_label_binary')
)
Variant, with get_dummies
and groupby.max
:
(pd.get_dummies(df_test['entity_label'])
.groupby(df_test['group']).max()
.reindex(columns=labs, fill_value=0)
.agg(list, axis=1)
.reset_index(name='entity_label_binary')
)
Output:
group entity_label_binary
0 0 [1, 1, 0, 0, 0, 0, 0]
1 1 [1, 0, 0, 0, 1, 0, 0]
2 2 [1, 1, 1, 0, 0, 0, 0]