join two pandas dataframes by two columns without order-CodePudding

I have two dataframes which look like this:

net1

	Gene_A	Gene_B	abs_dif	log2FC	omic
0	RBL2	ATOH7	0.277574	0.634319	trans
1	RBL2	USF2	0.276369	0.567421	trans
2	RBL2	RASA3	0.166101	0.474418	trans

net2

	Gene_A	Gene_B	abs_dif	log2FC	omic
0	ATOH7	RBL2	0.277574	0.634319	meth
1	RBL2	USF2	0.276369	0.567421	meth
2	RBL2	Fernando	0.166101	0.474418	meth

Each row represents a relationship. I am trying to merge both pandas dataframes based on columns 'Gene_A' and 'Gene_B', keeping both common and not common relationships. I am doing it using:

merged = pd.merge(net1, net2, how='outer', on=["Gene_A", "Gene_B"], indicator=True)

Which returns:

	Gene_A	Gene_B	abs_dif_x	log2FC_x	omic_x	abs_dif_y	log2FC_y	omic_y	_merge
0	RBL2	ATOH7	0.277574	0.634319	trans	nan	nan	nan	left_only
1	RBL2	USF2	0.276369	0.567421	trans	0.276369	0.567421	meth	both
2	ATOH7	RBL2	nan	nan	nan	0.277574	0.634319	meth	right_only
3	RBL2	Fernando	nan	nan	nan	0.166101	0.474418	meth	right_only

However, relationship 0 in both tables is basically the same, so the order of Gene_A and Gene_B should not be considered in the merging. Hence, the first row in the merged dataframe should be:

	Gene_A	Gene_B	abs_dif_x	log2FC_x	omic_x	abs_dif_y	log2FC_y	omic_y	_merge
0	RBL2	ATOH7	0.277574	0.634319	trans	0.277574	0.634319	meth	both

How can I merge the dataframes without considering the order of the elements in the two columns on which dataframes are merged?

CodePudding user response：

You can apply a frozenset on the columns you want to use as unordered key. Then use those as key to merge:

key1 = net1[['Gene_A', 'Gene_B']].apply(frozenset, axis=1)
key2 = net2[['Gene_A', 'Gene_B']].apply(frozenset, axis=1)
net1.merge(net2, left_on=key1, right_on=key2)

Output:

           key_0 Gene_A_x Gene_B_x  abs_dif_x  log2FC_x omic_x Gene_A_y Gene_B_y  abs_dif_y  log2FC_y omic_y
0  (ATOH7, RBL2)     RBL2    ATOH7   0.277574  0.634319  trans    ATOH7     RBL2   0.277574  0.634319   meth
1   (USF2, RBL2)     RBL2     USF2   0.276369  0.567421  trans     RBL2     USF2   0.276369  0.567421   meth

NB. As you have identical column names, you might need to perform some cleanup.

CodePudding user response：

There are many ways of doing this one of method i found which might be efficient but works :

for dataframe 1

net1['combined']=net1.apply(lambda x: str(set([x['Gene_A'], x['Gene_B']])),axis=1)

output:

  Gene_A    Gene_B   abs_dif    log2FC  omic              combined
0  ATOH7      RBL2  0.277574  0.634319  meth     {'ATOH7', 'RBL2'}
1   RBL2      USF2  0.276369  0.567421  meth      {'RBL2', 'USF2'}
2   RBL2  Fernando  0.166101  0.474418  meth  {'Fernando', 'RBL2'}

for dataframe2

net2['combined']=net2.apply(lambda x: str(set([x['Gene_A'], x['Gene_B']])),axis=1)

output:

  Gene_A Gene_B   abs_dif    log2FC   omic           combined
0   RBL2  ATOH7  0.277574  0.634319  trans  {'ATOH7', 'RBL2'}
1   RBL2   USF2  0.276369  0.567421  trans   {'RBL2', 'USF2'}
2   RBL2  RASA3  0.166101  0.474418  trans  {'RBL2', 'RASA3'}

for merging:

merged = pd.merge(net1, net2, how='outer', on=["combined"], indicator=True)

output:

  Gene_A_x  Gene_B_x  abs_dif_x  log2FC_x omic_x              combined Gene_A_y Gene_B_y  abs_dif_y  log2FC_y omic_y      _merge
0    ATOH7      RBL2   0.277574  0.634319   meth     {'ATOH7', 'RBL2'}     RBL2    ATOH7   0.277574  0.634319  trans        both
1     RBL2      USF2   0.276369  0.567421   meth      {'RBL2', 'USF2'}     RBL2     USF2   0.276369  0.567421  trans        both
2     RBL2  Fernando   0.166101  0.474418   meth  {'Fernando', 'RBL2'}      NaN      NaN        NaN       NaN    NaN   left_only
3      NaN       NaN        NaN       NaN    NaN     {'RBL2', 'RASA3'}     RBL2    RASA3   0.166101  0.474418  trans  right_only

Merged dataframe is as expected the unneccesary columns can be dropped using drop function.