I have one pandas dataframe like this:
>>> df1
col_1 col_2 labels
1 aaa abc [71020]
2 bbb cde [77085]
3 ccc efg [36415]
4 ddd ghi [99213, 99287]
5 eee ijk [99233, 71020, 36415]
and another dataframe like this(I got this df using sklearn.preprocessing.MultiLabelBinarizer on labels
column of above df1
):
>>> df2
71020 77085 36415 99213 99287 99233
1 1 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 0 0 0
4 0 0 0 1 1 0
5 1 0 1 0 0 1
and I would now like to merge(or right join) df2
on df2
, with keys being values from labels
column and all the column names of df2
, like shown below:
col_1 col_2 labels 71020 77085 36415 99213 99287 99233
1 aaa abc [71020] 1 0 0 0 0 0
2 bbb def [77085] 0 1 0 0 0 0
3 ccc ghi [36415] 0 0 1 0 0 0
4 ddd jkl [99213, 99287] 0 0 0 1 1 0
5 eee mno [99233, 71020, 36415] 1 0 1 0 0 1
How do I do this?
CodePudding user response:
Your desired output can be achieved (may be easier) if you create df2
from Pandas, as follows:
We create df2a
from df1
labels
column as follows:
# Assuming your `labels` column contain real list rather than string looks like list.
# If not, we can convert the string to real list first
import ast
df1['labels'] = df1['labels'].apply(ast.literal_eval)
# Then, create the dummy table by Pandas
df2a = df['labels'].explode().astype(str).str.get_dummies().groupby(level=0).max()
# Optionally convert the column labels back to integer from string
df2a.columns = df2a.columns.astype(int)
Result:
print(df2a)
36415 71020 77085 99213 99233 99287
0 0 1 0 0 0 0
1 0 0 1 0 0 0
2 1 0 0 0 0 0
3 0 0 0 1 0 1
4 1 1 0 0 1 0
Then, we can join with df1
to get the desired output:
df1.join(df2a)
Result:
col_1 col_2 labels 36415 71020 77085 99213 99233 99287
0 aaa abc [71020] 0 1 0 0 0 0
1 bbb cde [77085] 0 0 1 0 0 0
2 ccc efg [36415] 1 0 0 0 0 0
3 ddd ghi [99213, 99287] 0 0 0 1 0 1
4 eee ijk [99233, 71020, 36415] 1 1 0 0 1 0