Remove row from one tab if corresponding coordinate from another tab overlap-CodePudding

I have two dataframe

Tab1

Species      start  end 
G1:Sp1       5      30
G1:Sp1       60     90 
G1:Sp2       50     120

Tab2

Species start end
G1:SP1   1     4
G1:SP1   30    59 
G1:SP1   70    93
G1:SP2   48    110 
G1:SP2   50    120
G1:SP2   56    124 
G1:SP2   130   170

And I would like to remove from Tab2 each Species where the start and end coordinates overlaps with the corresponding Species coordinates in Tab1.

What I mean by overlap ?:

Let's say I have in Tab1:

Species      start  end 
G1:Sp1       5      30

Then if a G1:SP1 in Tab2 have the following start and end:

start end 
1     5  (does not overlap with 5-30)
5     27 (does overlap with 5-30, from inside) 
5     30 (does overlap with 5-30, exactly the same coordinate)
10    27 (does overlap with 5-30, from inside) 
10    33 (does overlap with 5-30, from inside and outside right)
2     25 (does overlap with 5-30, from inside and outside left)
33    50 (does not overlap with 5-30)

In the exemple then :

G1:SP1   1     4   <- does not overlap (I keep)
G1:SP1   30    59  <- does not overlap (I keep) 
G1:SP1   70    93  <- does overlap with (**60-90**, I remove) 
G1:SP2   48    110 <- does overlap with (**50-120**, I remove) 
G1:SP2   50    120 <- does overlap with (**50-120**, I remove) 
G1:SP2   56    124 <- does not overlap (I keep)
G1:SP2   130   170 <- does not overlap (I keep)

I should then get:

Species start end
G1:SP1   1     4
G1:SP1   30    59 
G1:SP2   56    124 
G1:SP2   130   170

Here are the data in dic format if it can help :

tab1.to_dict()

{'Species': {0: 'G1:Sp1', 1: 'G1:Sp1', 2: 'G1:Sp2'}, 'start': {0: 5, 1: 60, 2: 50}, 'end ': {0: 30, 1: 90, 2: 120}}

tab2.to_dict()

{'Species': {0: 'G1:SP1', 1: 'G1:SP1', 2: 'G1:SP1', 3: 'G1:SP2', 4: 'G1:SP2', 5: 'G1:SP2', 6: 'G1:SP2'}, 'start': {0: 1, 1: 30, 2: 70, 3: 48, 4: 50, 5: 56, 6: 130}, 'end': {0: 4, 1: 59, 2: 93, 3: 110, 4: 120, 5: 124, 6: 170}}

CodePudding user response：

Solution

tab1['key'] = tab1['Species'].str.upper()
tab2['key'] = tab2['Species'].str.upper()

tab = tab2.reset_index().merge(tab1, on='key', 
                               how='left', suffixes=['', '_y'])

is_overlap = tab.eval('(start >= start_y and start <= end_y)'
                      'or (end >= start_y and end <= end_y)')

idx = tab.loc[~is_overlap, 'index']
tab2_new = tab2[tab2.index.isin(idx)]

Step by step details:

Transform the column Species in both the dataframes to create a new key column which will act as a case insensitive merge key

>>> tab1
  Species  start  end     key
0  G1:Sp1      5   30  G1:SP1
1  G1:Sp1     60   90  G1:SP1
2  G1:Sp2     50  120  G1:SP2

>>> tab2
  Species  start  end     key
0  G1:SP1      1    4  G1:SP1
1  G1:SP1     30   59  G1:SP1
2  G1:SP1     70   93  G1:SP1
3  G1:SP2     48  110  G1:SP2
4  G1:SP2     50  120  G1:SP2
5  G1:SP2     56  124  G1:SP2
6  G1:SP2    130  170  G1:SP2

Reset the index of tab2 then left merge with tab1

>>> tab

   index Species  start  end     key Species_y  start_y  end_y
0      0  G1:SP1      1    4  G1:SP1    G1:Sp1        5     30
1      0  G1:SP1      1    4  G1:SP1    G1:Sp1       60     90
2      1  G1:SP1     30   59  G1:SP1    G1:Sp1        5     30
3      1  G1:SP1     30   59  G1:SP1    G1:Sp1       60     90
4      2  G1:SP1     70   93  G1:SP1    G1:Sp1        5     30
5      2  G1:SP1     70   93  G1:SP1    G1:Sp1       60     90
6      3  G1:SP2     48  110  G1:SP2    G1:Sp2       50    120
7      4  G1:SP2     50  120  G1:SP2    G1:Sp2       50    120
8      5  G1:SP2     56  124  G1:SP2    G1:Sp2       50    120
9      6  G1:SP2    130  170  G1:SP2    G1:Sp2       50    120

Evaluate the condition which checks for the overlaps in two dataframes

>>> is_overlap
0    False
1    False
2     True
3    False
4    False
5     True
6     True
7     True
8     True
9    False
dtype: bool

Select the indices which does not satisfy the is_overlap condition

>>> idx

0    0
1    0
3    1
4    2
9    6
Name: index, dtype: int64

Filter the rows in tab2 using the above indices

>>> tab2_new

  Species  start  end     key
0  G1:SP1      1    4  G1:SP1
1  G1:SP1     30   59  G1:SP1
2  G1:SP1     70   93  G1:SP1
6  G1:SP2    130  170  G1:SP2