I have a df like this:
>>> df1
col_1 col_2 labels
0 aaa abc (71020,)
1 ddd ghi (99213, 99287,)
2 bbb cde (77085,)
3 eee ijk (99233, 71020, 36415,)
and another df like this:
>>> df2
71020 77085 36415 99213 99287 99233 labels_mg
0 1 0 1 0 0 1 (99233, 71020, 36415,)
1 1 0 0 0 0 0 (71020,)
2 0 0 0 1 1 0 (99213, 99287)
3 0 1 0 0 0 0 (77085,)
and would like to generate a df by right-joining above 2 dfs, like this below:
col_1 col_2 labels 71020 77085 36415 99213 99287 99233
0 aaa abc (71020,) 1 0 0 0 0 0
1 ddd ghi (99213, 99287,) 0 0 0 1 1 0
2 bbb cde (77085,) 0 1 0 0 0 0
3 eee ijk (99233, 71020, 36415,) 1 0 1 0 0 1
Here's what I have tried, but this generates an empty dataframe with 0 rows, but has all column names.
pd.merge(left=df1, right=df2, left_on=['labels'], right_on=['labels_mg'])
tuples are parsed as tuples in both dfs. I have done ast.literal_eval
on columns on both of those df after reading from files to pandas dfs. both dfs doesn't share common index too.
my df sizes are (528840, 207) and (528840, 5). how do i do this efficiently?
CodePudding user response:
There was problem one tuples was integers, second was filled by strings, solution is converted them to integers like:
import ast
df1['labels'] = df1['labels'].apply(ast.literal_eval)
df2['labels_mg'] = df2['labels_mg'].apply(lambda x: tuple([int(y) for y in ast.literal_eval(x)]))
df = pd.merge(left=df1, right=df2, left_on=['labels'], right_on=['labels_mg'])
print (df)
col_1 col_2 labels 71020 77085 36415 99213 99287 \
0 aaa abc (71020,) 1 0 0 0 0
1 ddd ghi (99213, 99287) 0 0 0 1 1
2 bbb cde (77085,) 0 1 0 0 0
3 eee ijk (99233, 71020, 36415) 1 0 1 0 0
99233 labels_mg
0 0 (71020,)
1 0 (99213, 99287)
2 0 (77085,)
3 1 (99233, 71020, 36415)