Home > Mobile >  Efficient way of splitting or deleting the dataframe rows based on range filtering
Efficient way of splitting or deleting the dataframe rows based on range filtering

Time:03-01

I have 2 dataframes of unequal lengths among which the 1st one's rows will be filtered based on the ranges of the 2nd dataframe. For the better context of the I/O, please refer to this post: Efficient solution of dataframe range filtering based on another ranges

For example:

M = (x,y) = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = (m,n) = [(5,7), (15,16), (15,18), (21,25), (5,25), (5,15)]

Case-1: 
    M = [(10,20)]
    E = [(5,7)]
out: M = [(10,20)] (no change, because out of E range)

Case-2: 
    M = [(10,20)]
    E = [(15,16)]
out: M = [(10,14),(17,20)] (split (10,20) into 2 rows to remove E range inside it)

Case-3: 
    M = [(10,20)]
    E = [(21,25)]
out: M = [(10,20)] (no change, because out of E range)

Case-4: 
    M = [(10,20)]
    E = [(5,25)]
out: M = [] (delete because totally inclusive within range of E)

Case-5: 
    M = [(10,20)]
    E = [(5,15)]
out: M = [(16,20)] (because (16,20) isn't E range inclusive)

Case-6: 
    M = [(10,20)]
    E = [(13,20)]
out: M = [(10,12)] (because (10,12) isn't E range inclusive)

I have formulated the following algorithm for the above-stated cases:

M = (x,y) 
E = (m,n) 
if (m<=x): 
    if(y<=n):
        delete the row
    elif (x<=n):
        (start, end) = (n 1,y)
    else:
        continue
else:
    if(y>=n):
        (start, end) = (x,m-1)
        (start, end) = (n-1,y)
    elif (y>=m):
        (start, end) = (x,m-1)
    else:
        continue

But I wanted to implement it using NumPy and pandas combination:

df1 = pd.read_csv('a.tsv', sep='\t') #main dataframe which I want to filter
temp_bed = bedfile # 2nd dataframe based on which I need to filter
# for array broadcasting
m = temp_bed['first.start'].to_numpy()[:, None]
n = temp_bed['first.end'].to_numpy()[:, None]

# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []

while offset < len(df1):
    x = df1['first.start'].to_numpy()[offset:offset chunk_size] #main
    y = df1['first.end'].to_numpy()[offset:offset chunk_size]
    mask.append(
# necessary logical conditions #####
# but the problem is with splitting the rows or ranges
        ((m <= x) & (n >= y)).any(axis=0)
    )    

    offset  = chunk_size
import numpy as np
mask = np.hstack(mask)
df1[mask]

Could anyone give me an efficient solution for splitting or deleting or ignoring the dataframe rows based on the above conditions?

CodePudding user response:

Your case can be solved using Numpy and Pandas together.

To get results for all your cases, I extended M and E by one pair:

M = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = [( 5, 7), (15,16), (15,18), (21,25), ( 5,25), ( 5,15), (13,20)]

To get result, both as a list and a sequence of pairs (ranges of consecutive numbers, inclusive), you can run:

for m1, m2, e1, e2 in np.hstack([M, E]):
    s = pd.Index(np.arange(m1, m2   1)).difference(pd.Index(
        np.arange(e1, e2   1))).to_series()
    rng = s.groupby(s.subtract(s.shift()).gt(1).cumsum()).apply(
        lambda grp: (grp.iloc[0], grp.iloc[-1]))
    print(f'{m1:2}, {m2:2},  {e1:2}, {e2:2}\n{s.tolist()}\n{rng.values}\n')

For the above source data I got:

10, 20,   5,  7
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]

10, 20,  15, 16
[10, 11, 12, 13, 14, 17, 18, 19, 20]
[(10, 14) (17, 20)]

10, 20,  15, 18
[10, 11, 12, 13, 14, 19, 20]
[(10, 14) (19, 20)]

10, 20,  21, 25
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]

10, 20,   5, 25
[]
[]

10, 20,   5, 15
[16, 17, 18, 19, 20]
[(16, 20)]

10, 20,  13, 20
[10, 11, 12]
[(10, 12)]

Each 3-rows sequence contains:

  • original data (a pair from M and a pair from E),
  • the result as a list of numbers,
  • the result as a sequence of ramges.
  • Related