I have two dataframes which look like this:
net1
Gene_A | Gene_B | abs_dif | log2FC | omic | |
---|---|---|---|---|---|
0 | RBL2 | ATOH7 | 0.277574 | 0.634319 | trans |
1 | RBL2 | USF2 | 0.276369 | 0.567421 | trans |
2 | RBL2 | RASA3 | 0.166101 | 0.474418 | trans |
net2
Gene_A | Gene_B | abs_dif | log2FC | omic | |
---|---|---|---|---|---|
0 | ATOH7 | RBL2 | 0.277574 | 0.634319 | meth |
1 | RBL2 | USF2 | 0.276369 | 0.567421 | meth |
2 | RBL2 | Fernando | 0.166101 | 0.474418 | meth |
Each row represents a relationship. I am trying to merge both pandas dataframes based on columns 'Gene_A' and 'Gene_B', keeping both common and not common relationships. I am doing it using:
merged = pd.merge(net1, net2, how='outer', on=["Gene_A", "Gene_B"], indicator=True)
Which returns:
Gene_A | Gene_B | abs_dif_x | log2FC_x | omic_x | abs_dif_y | log2FC_y | omic_y | _merge | |
---|---|---|---|---|---|---|---|---|---|
0 | RBL2 | ATOH7 | 0.277574 | 0.634319 | trans | nan | nan | nan | left_only |
1 | RBL2 | USF2 | 0.276369 | 0.567421 | trans | 0.276369 | 0.567421 | meth | both |
2 | ATOH7 | RBL2 | nan | nan | nan | 0.277574 | 0.634319 | meth | right_only |
3 | RBL2 | Fernando | nan | nan | nan | 0.166101 | 0.474418 | meth | right_only |
However, relationship 0
in both tables is basically the same, so the order of Gene_A and Gene_B should not be considered in the merging. Hence, the first row in the merged dataframe should be:
Gene_A | Gene_B | abs_dif_x | log2FC_x | omic_x | abs_dif_y | log2FC_y | omic_y | _merge | |
---|---|---|---|---|---|---|---|---|---|
0 | RBL2 | ATOH7 | 0.277574 | 0.634319 | trans | 0.277574 | 0.634319 | meth | both |
How can I merge the dataframes without considering the order of the elements in the two columns on which dataframes are merged?
CodePudding user response:
You can apply
a frozenset
on the columns you want to use as unordered key. Then use those as key to merge:
key1 = net1[['Gene_A', 'Gene_B']].apply(frozenset, axis=1)
key2 = net2[['Gene_A', 'Gene_B']].apply(frozenset, axis=1)
net1.merge(net2, left_on=key1, right_on=key2)
Output:
key_0 Gene_A_x Gene_B_x abs_dif_x log2FC_x omic_x Gene_A_y Gene_B_y abs_dif_y log2FC_y omic_y
0 (ATOH7, RBL2) RBL2 ATOH7 0.277574 0.634319 trans ATOH7 RBL2 0.277574 0.634319 meth
1 (USF2, RBL2) RBL2 USF2 0.276369 0.567421 trans RBL2 USF2 0.276369 0.567421 meth
NB. As you have identical column names, you might need to perform some cleanup.
CodePudding user response:
There are many ways of doing this one of method i found which might be efficient but works :
for dataframe 1
net1['combined']=net1.apply(lambda x: str(set([x['Gene_A'], x['Gene_B']])),axis=1)
output:
Gene_A Gene_B abs_dif log2FC omic combined
0 ATOH7 RBL2 0.277574 0.634319 meth {'ATOH7', 'RBL2'}
1 RBL2 USF2 0.276369 0.567421 meth {'RBL2', 'USF2'}
2 RBL2 Fernando 0.166101 0.474418 meth {'Fernando', 'RBL2'}
for dataframe2
net2['combined']=net2.apply(lambda x: str(set([x['Gene_A'], x['Gene_B']])),axis=1)
output:
Gene_A Gene_B abs_dif log2FC omic combined
0 RBL2 ATOH7 0.277574 0.634319 trans {'ATOH7', 'RBL2'}
1 RBL2 USF2 0.276369 0.567421 trans {'RBL2', 'USF2'}
2 RBL2 RASA3 0.166101 0.474418 trans {'RBL2', 'RASA3'}
for merging:
merged = pd.merge(net1, net2, how='outer', on=["combined"], indicator=True)
output:
Gene_A_x Gene_B_x abs_dif_x log2FC_x omic_x combined Gene_A_y Gene_B_y abs_dif_y log2FC_y omic_y _merge
0 ATOH7 RBL2 0.277574 0.634319 meth {'ATOH7', 'RBL2'} RBL2 ATOH7 0.277574 0.634319 trans both
1 RBL2 USF2 0.276369 0.567421 meth {'RBL2', 'USF2'} RBL2 USF2 0.276369 0.567421 trans both
2 RBL2 Fernando 0.166101 0.474418 meth {'Fernando', 'RBL2'} NaN NaN NaN NaN NaN left_only
3 NaN NaN NaN NaN NaN {'RBL2', 'RASA3'} RBL2 RASA3 0.166101 0.474418 trans right_only
Merged dataframe is as expected the unneccesary columns can be dropped using drop function.