Home > OS >  join two pandas dataframes by two columns without order
join two pandas dataframes by two columns without order

Time:01-03

I have two dataframes which look like this:

net1

Gene_A Gene_B abs_dif log2FC omic
0 RBL2 ATOH7 0.277574 0.634319 trans
1 RBL2 USF2 0.276369 0.567421 trans
2 RBL2 RASA3 0.166101 0.474418 trans

net2

Gene_A Gene_B abs_dif log2FC omic
0 ATOH7 RBL2 0.277574 0.634319 meth
1 RBL2 USF2 0.276369 0.567421 meth
2 RBL2 Fernando 0.166101 0.474418 meth

Each row represents a relationship. I am trying to merge both pandas dataframes based on columns 'Gene_A' and 'Gene_B', keeping both common and not common relationships. I am doing it using:

merged = pd.merge(net1, net2, how='outer', on=["Gene_A", "Gene_B"], indicator=True)

Which returns:

Gene_A Gene_B abs_dif_x log2FC_x omic_x abs_dif_y log2FC_y omic_y _merge
0 RBL2 ATOH7 0.277574 0.634319 trans nan nan nan left_only
1 RBL2 USF2 0.276369 0.567421 trans 0.276369 0.567421 meth both
2 ATOH7 RBL2 nan nan nan 0.277574 0.634319 meth right_only
3 RBL2 Fernando nan nan nan 0.166101 0.474418 meth right_only

However, relationship 0 in both tables is basically the same, so the order of Gene_A and Gene_B should not be considered in the merging. Hence, the first row in the merged dataframe should be:

Gene_A Gene_B abs_dif_x log2FC_x omic_x abs_dif_y log2FC_y omic_y _merge
0 RBL2 ATOH7 0.277574 0.634319 trans 0.277574 0.634319 meth both

How can I merge the dataframes without considering the order of the elements in the two columns on which dataframes are merged?

CodePudding user response:

You can apply a frozenset on the columns you want to use as unordered key. Then use those as key to merge:

key1 = net1[['Gene_A', 'Gene_B']].apply(frozenset, axis=1)
key2 = net2[['Gene_A', 'Gene_B']].apply(frozenset, axis=1)
net1.merge(net2, left_on=key1, right_on=key2)

Output:

           key_0 Gene_A_x Gene_B_x  abs_dif_x  log2FC_x omic_x Gene_A_y Gene_B_y  abs_dif_y  log2FC_y omic_y
0  (ATOH7, RBL2)     RBL2    ATOH7   0.277574  0.634319  trans    ATOH7     RBL2   0.277574  0.634319   meth
1   (USF2, RBL2)     RBL2     USF2   0.276369  0.567421  trans     RBL2     USF2   0.276369  0.567421   meth

NB. As you have identical column names, you might need to perform some cleanup.

CodePudding user response:

There are many ways of doing this one of method i found which might be efficient but works :

for dataframe 1

net1['combined']=net1.apply(lambda x: str(set([x['Gene_A'], x['Gene_B']])),axis=1)

output:

  Gene_A    Gene_B   abs_dif    log2FC  omic              combined
0  ATOH7      RBL2  0.277574  0.634319  meth     {'ATOH7', 'RBL2'}
1   RBL2      USF2  0.276369  0.567421  meth      {'RBL2', 'USF2'}
2   RBL2  Fernando  0.166101  0.474418  meth  {'Fernando', 'RBL2'}

for dataframe2

net2['combined']=net2.apply(lambda x: str(set([x['Gene_A'], x['Gene_B']])),axis=1)

output:

  Gene_A Gene_B   abs_dif    log2FC   omic           combined
0   RBL2  ATOH7  0.277574  0.634319  trans  {'ATOH7', 'RBL2'}
1   RBL2   USF2  0.276369  0.567421  trans   {'RBL2', 'USF2'}
2   RBL2  RASA3  0.166101  0.474418  trans  {'RBL2', 'RASA3'}

for merging:

merged = pd.merge(net1, net2, how='outer', on=["combined"], indicator=True)

output:

  Gene_A_x  Gene_B_x  abs_dif_x  log2FC_x omic_x              combined Gene_A_y Gene_B_y  abs_dif_y  log2FC_y omic_y      _merge
0    ATOH7      RBL2   0.277574  0.634319   meth     {'ATOH7', 'RBL2'}     RBL2    ATOH7   0.277574  0.634319  trans        both
1     RBL2      USF2   0.276369  0.567421   meth      {'RBL2', 'USF2'}     RBL2     USF2   0.276369  0.567421  trans        both
2     RBL2  Fernando   0.166101  0.474418   meth  {'Fernando', 'RBL2'}      NaN      NaN        NaN       NaN    NaN   left_only
3      NaN       NaN        NaN       NaN    NaN     {'RBL2', 'RASA3'}     RBL2    RASA3   0.166101  0.474418  trans  right_only

Merged dataframe is as expected the unneccesary columns can be dropped using drop function.

  • Related