get a sparse df of a column with lists of varying length and ensure the proper merge-CodePudding

I have one pandas dataframe like this:

>>> df1

        col_1   col_2    labels
1        aaa     abc     [71020]
2        bbb     cde     [77085]
3        ccc     efg     [36415]
4        ddd     ghi     [99213, 99287]
5        eee     ijk     [99233, 71020, 36415]

and another dataframe like this(I got this df using sklearn.preprocessing.MultiLabelBinarizer on labels column of above df1):

>>> df2

   71020  77085  36415  99213  99287  99233
1    1      0      0      0      0      0
2    0      1      0      0      0      0
3    0      0      1      0      0      0
4    0      0      0      1      1      0
5    1      0      1      0      0      1

and I would now like to merge(or right join) df2 on df2, with keys being values from labels column and all the column names of df2, like shown below:


        col_1   col_2    labels                   71020  77085  36415  99213  99287  99233
1        aaa     abc     [71020]                     1      0      0      0      0      0
2        bbb     def     [77085]                     0      1      0      0      0      0
3        ccc     ghi     [36415]                     0      0      1      0      0      0
4        ddd     jkl     [99213, 99287]              0      0      0      1      1      0
5        eee     mno     [99233, 71020, 36415]       1      0      1      0      0      1

How do I do this?

CodePudding user response：

Your desired output can be achieved (may be easier) if you create df2 from Pandas, as follows:

We create df2a from df1 labels column as follows:

# Assuming your `labels` column contain real list rather than string looks like list.  
# If not, we can convert the string to real list first
import ast
df1['labels'] = df1['labels'].apply(ast.literal_eval)

# Then, create the dummy table by Pandas
df2a = df['labels'].explode().astype(str).str.get_dummies().groupby(level=0).max()

# Optionally convert the column labels back to integer from string
df2a.columns = df2a.columns.astype(int)

Result:

print(df2a)

   36415  71020  77085  99213  99233  99287
0      0      1      0      0      0      0
1      0      0      1      0      0      0
2      1      0      0      0      0      0
3      0      0      0      1      0      1
4      1      1      0      0      1      0

Then, we can join with df1 to get the desired output:

df1.join(df2a)

Result:

  col_1 col_2                 labels  36415  71020  77085  99213  99233  99287
0   aaa   abc                [71020]      0      1      0      0      0      0
1   bbb   cde                [77085]      0      0      1      0      0      0
2   ccc   efg                [36415]      1      0      0      0      0      0
3   ddd   ghi         [99213, 99287]      0      0      0      1      0      1
4   eee   ijk  [99233, 71020, 36415]      1      1      0      0      1      0