Efficient way of splitting or deleting the dataframe rows based on range filtering-CodePudding

I have 2 dataframes of unequal lengths among which the 1st one's rows will be filtered based on the ranges of the 2nd dataframe. For the better context of the I/O, please refer to this post: Efficient solution of dataframe range filtering based on another ranges

For example:

M = (x,y) = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = (m,n) = [(5,7), (15,16), (15,18), (21,25), (5,25), (5,15)]

Case-1: 
    M = [(10,20)]
    E = [(5,7)]
out: M = [(10,20)] (no change, because out of E range)

Case-2: 
    M = [(10,20)]
    E = [(15,16)]
out: M = [(10,14),(17,20)] (split (10,20) into 2 rows to remove E range inside it)

Case-3: 
    M = [(10,20)]
    E = [(21,25)]
out: M = [(10,20)] (no change, because out of E range)

Case-4: 
    M = [(10,20)]
    E = [(5,25)]
out: M = [] (delete because totally inclusive within range of E)

Case-5: 
    M = [(10,20)]
    E = [(5,15)]
out: M = [(16,20)] (because (16,20) isn't E range inclusive)

Case-6: 
    M = [(10,20)]
    E = [(13,20)]
out: M = [(10,12)] (because (10,12) isn't E range inclusive)

I have formulated the following algorithm for the above-stated cases:

M = (x,y) 
E = (m,n) 
if (m<=x): 
    if(y<=n):
        delete the row
    elif (x<=n):
        (start, end) = (n 1,y)
    else:
        continue
else:
    if(y>=n):
        (start, end) = (x,m-1)
        (start, end) = (n-1,y)
    elif (y>=m):
        (start, end) = (x,m-1)
    else:
        continue

But I wanted to implement it using NumPy and pandas combination:

df1 = pd.read_csv('a.tsv', sep='\t') #main dataframe which I want to filter
temp_bed = bedfile # 2nd dataframe based on which I need to filter
# for array broadcasting
m = temp_bed['first.start'].to_numpy()[:, None]
n = temp_bed['first.end'].to_numpy()[:, None]

# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []

while offset < len(df1):
    x = df1['first.start'].to_numpy()[offset:offset chunk_size] #main
    y = df1['first.end'].to_numpy()[offset:offset chunk_size]
    mask.append(
# necessary logical conditions #####
# but the problem is with splitting the rows or ranges
        ((m <= x) & (n >= y)).any(axis=0)
    )    

    offset  = chunk_size
import numpy as np
mask = np.hstack(mask)
df1[mask]

Could anyone give me an efficient solution for splitting or deleting or ignoring the dataframe rows based on the above conditions?

CodePudding user response：

Your case can be solved using Numpy and Pandas together.

To get results for all your cases, I extended M and E by one pair:

M = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = [( 5, 7), (15,16), (15,18), (21,25), ( 5,25), ( 5,15), (13,20)]

To get result, both as a list and a sequence of pairs (ranges of consecutive numbers, inclusive), you can run:

for m1, m2, e1, e2 in np.hstack([M, E]):
    s = pd.Index(np.arange(m1, m2   1)).difference(pd.Index(
        np.arange(e1, e2   1))).to_series()
    rng = s.groupby(s.subtract(s.shift()).gt(1).cumsum()).apply(
        lambda grp: (grp.iloc[0], grp.iloc[-1]))
    print(f'{m1:2}, {m2:2},  {e1:2}, {e2:2}\n{s.tolist()}\n{rng.values}\n')

For the above source data I got:

10, 20,   5,  7
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]

10, 20,  15, 16
[10, 11, 12, 13, 14, 17, 18, 19, 20]
[(10, 14) (17, 20)]

10, 20,  15, 18
[10, 11, 12, 13, 14, 19, 20]
[(10, 14) (19, 20)]

10, 20,  21, 25
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]

10, 20,   5, 25
[]
[]

10, 20,   5, 15
[16, 17, 18, 19, 20]
[(16, 20)]

10, 20,  13, 20
[10, 11, 12]
[(10, 12)]

Each 3-rows sequence contains:

original data (a pair from M and a pair from E),
the result as a list of numbers,
the result as a sequence of ramges.