I have two dataframe
Tab1
Species start end
G1:Sp1 5 30
G1:Sp1 60 90
G1:Sp2 50 120
Tab2
Species start end
G1:SP1 1 4
G1:SP1 30 59
G1:SP1 70 93
G1:SP2 48 110
G1:SP2 50 120
G1:SP2 56 124
G1:SP2 130 170
And I would like to remove from Tab2 each Species
where the start
and end
coordinates overlaps with the corresponding Species
coordinates in Tab1.
What I mean by overlap ?:
Let's say I have in Tab1:
Species start end
G1:Sp1 5 30
Then if a G1:SP1
in Tab2 have the following start
and end
:
start end
1 5 (does not overlap with 5-30)
5 27 (does overlap with 5-30, from inside)
5 30 (does overlap with 5-30, exactly the same coordinate)
10 27 (does overlap with 5-30, from inside)
10 33 (does overlap with 5-30, from inside and outside right)
2 25 (does overlap with 5-30, from inside and outside left)
33 50 (does not overlap with 5-30)
In the exemple then :
G1:SP1 1 4 <- does not overlap (I keep)
G1:SP1 30 59 <- does not overlap (I keep)
G1:SP1 70 93 <- does overlap with (**60-90**, I remove)
G1:SP2 48 110 <- does overlap with (**50-120**, I remove)
G1:SP2 50 120 <- does overlap with (**50-120**, I remove)
G1:SP2 56 124 <- does not overlap (I keep)
G1:SP2 130 170 <- does not overlap (I keep)
I should then get:
Species start end
G1:SP1 1 4
G1:SP1 30 59
G1:SP2 56 124
G1:SP2 130 170
Here are the data in dic format if it can help :
tab1.to_dict()
{'Species': {0: 'G1:Sp1', 1: 'G1:Sp1', 2: 'G1:Sp2'}, 'start': {0: 5, 1: 60, 2: 50}, 'end ': {0: 30, 1: 90, 2: 120}}
tab2.to_dict()
{'Species': {0: 'G1:SP1', 1: 'G1:SP1', 2: 'G1:SP1', 3: 'G1:SP2', 4: 'G1:SP2', 5: 'G1:SP2', 6: 'G1:SP2'}, 'start': {0: 1, 1: 30, 2: 70, 3: 48, 4: 50, 5: 56, 6: 130}, 'end': {0: 4, 1: 59, 2: 93, 3: 110, 4: 120, 5: 124, 6: 170}}
CodePudding user response:
Solution
tab1['key'] = tab1['Species'].str.upper()
tab2['key'] = tab2['Species'].str.upper()
tab = tab2.reset_index().merge(tab1, on='key',
how='left', suffixes=['', '_y'])
is_overlap = tab.eval('(start >= start_y and start <= end_y)'
'or (end >= start_y and end <= end_y)')
idx = tab.loc[~is_overlap, 'index']
tab2_new = tab2[tab2.index.isin(idx)]
Step by step details:
- Transform the column
Species
in both the dataframes to create a newkey
column which will act as a case insensitive merge key
>>> tab1
Species start end key
0 G1:Sp1 5 30 G1:SP1
1 G1:Sp1 60 90 G1:SP1
2 G1:Sp2 50 120 G1:SP2
>>> tab2
Species start end key
0 G1:SP1 1 4 G1:SP1
1 G1:SP1 30 59 G1:SP1
2 G1:SP1 70 93 G1:SP1
3 G1:SP2 48 110 G1:SP2
4 G1:SP2 50 120 G1:SP2
5 G1:SP2 56 124 G1:SP2
6 G1:SP2 130 170 G1:SP2
- Reset the index of
tab2
then left merge withtab1
>>> tab
index Species start end key Species_y start_y end_y
0 0 G1:SP1 1 4 G1:SP1 G1:Sp1 5 30
1 0 G1:SP1 1 4 G1:SP1 G1:Sp1 60 90
2 1 G1:SP1 30 59 G1:SP1 G1:Sp1 5 30
3 1 G1:SP1 30 59 G1:SP1 G1:Sp1 60 90
4 2 G1:SP1 70 93 G1:SP1 G1:Sp1 5 30
5 2 G1:SP1 70 93 G1:SP1 G1:Sp1 60 90
6 3 G1:SP2 48 110 G1:SP2 G1:Sp2 50 120
7 4 G1:SP2 50 120 G1:SP2 G1:Sp2 50 120
8 5 G1:SP2 56 124 G1:SP2 G1:Sp2 50 120
9 6 G1:SP2 130 170 G1:SP2 G1:Sp2 50 120
- Evaluate the condition which checks for the overlaps in two dataframes
>>> is_overlap
0 False
1 False
2 True
3 False
4 False
5 True
6 True
7 True
8 True
9 False
dtype: bool
- Select the indices which does not satisfy the
is_overlap
condition
>>> idx
0 0
1 0
3 1
4 2
9 6
Name: index, dtype: int64
- Filter the rows in
tab2
using the above indices
>>> tab2_new
Species start end key
0 G1:SP1 1 4 G1:SP1
1 G1:SP1 30 59 G1:SP1
2 G1:SP1 70 93 G1:SP1
6 G1:SP2 130 170 G1:SP2