Home > OS >  pandas - merge on a column with tuples
pandas - merge on a column with tuples

Time:10-21

I have a df like this:

>>> df1

        col_1   col_2    labels
0        aaa     abc     (71020,)
1        ddd     ghi     (99213, 99287,)
2        bbb     cde     (77085,)
3        eee     ijk     (99233, 71020, 36415,)

and another df like this:

>>> df2

   71020  77085  36415  99213  99287  99233  labels_mg
0    1      0      1      0      0      1     (99233, 71020, 36415,)
1    1      0      0      0      0      0     (71020,)
2    0      0      0      1      1      0     (99213, 99287)
3    0      1      0      0      0      0     (77085,)

and would like to generate a df by right-joining above 2 dfs, like this below:

        col_1   col_2    labels                     71020  77085  36415  99213  99287  99233
0        aaa     abc     (71020,)                    1      0      0      0      0      0
1        ddd     ghi     (99213, 99287,)             0      0      0      1      1      0
2        bbb     cde     (77085,)                    0      1      0      0      0      0
3        eee     ijk     (99233, 71020, 36415,)      1      0      1      0      0      1 

Here's what I have tried, but this generates an empty dataframe with 0 rows, but has all column names.

pd.merge(left=df1, right=df2, left_on=['labels'], right_on=['labels_mg'])

tuples are parsed as tuples in both dfs. I have done ast.literal_eval on columns on both of those df after reading from files to pandas dfs. both dfs doesn't share common index too.

my df sizes are (528840, 207) and (528840, 5). how do i do this efficiently?

CodePudding user response:

There was problem one tuples was integers, second was filled by strings, solution is converted them to integers like:

import ast

df1['labels'] = df1['labels'].apply(ast.literal_eval)
df2['labels_mg'] = df2['labels_mg'].apply(lambda x: tuple([int(y) for y in ast.literal_eval(x)]))
    
df = pd.merge(left=df1, right=df2, left_on=['labels'], right_on=['labels_mg'])
print (df)
  col_1 col_2                 labels  71020  77085  36415  99213  99287  \
0   aaa   abc               (71020,)      1      0      0      0      0   
1   ddd   ghi         (99213, 99287)      0      0      0      1      1   
2   bbb   cde               (77085,)      0      1      0      0      0   
3   eee   ijk  (99233, 71020, 36415)      1      0      1      0      0   

   99233              labels_mg  
0      0               (71020,)  
1      0         (99213, 99287)  
2      0               (77085,)  
3      1  (99233, 71020, 36415)  
  • Related