I have 2 dataframes of unequal lengths among which the 1st one's rows will be filtered based on the ranges of the 2nd dataframe. For the better context of the I/O, please refer to this post: Efficient solution of dataframe range filtering based on another ranges
For example:
M = (x,y) = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = (m,n) = [(5,7), (15,16), (15,18), (21,25), (5,25), (5,15)]
Case-1:
M = [(10,20)]
E = [(5,7)]
out: M = [(10,20)] (no change, because out of E range)
Case-2:
M = [(10,20)]
E = [(15,16)]
out: M = [(10,14),(17,20)] (split (10,20) into 2 rows to remove E range inside it)
Case-3:
M = [(10,20)]
E = [(21,25)]
out: M = [(10,20)] (no change, because out of E range)
Case-4:
M = [(10,20)]
E = [(5,25)]
out: M = [] (delete because totally inclusive within range of E)
Case-5:
M = [(10,20)]
E = [(5,15)]
out: M = [(16,20)] (because (16,20) isn't E range inclusive)
Case-6:
M = [(10,20)]
E = [(13,20)]
out: M = [(10,12)] (because (10,12) isn't E range inclusive)
I have formulated the following algorithm for the above-stated cases:
M = (x,y)
E = (m,n)
if (m<=x):
if(y<=n):
delete the row
elif (x<=n):
(start, end) = (n 1,y)
else:
continue
else:
if(y>=n):
(start, end) = (x,m-1)
(start, end) = (n-1,y)
elif (y>=m):
(start, end) = (x,m-1)
else:
continue
But I wanted to implement it using NumPy and pandas combination:
df1 = pd.read_csv('a.tsv', sep='\t') #main dataframe which I want to filter
temp_bed = bedfile # 2nd dataframe based on which I need to filter
# for array broadcasting
m = temp_bed['first.start'].to_numpy()[:, None]
n = temp_bed['first.end'].to_numpy()[:, None]
# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []
while offset < len(df1):
x = df1['first.start'].to_numpy()[offset:offset chunk_size] #main
y = df1['first.end'].to_numpy()[offset:offset chunk_size]
mask.append(
# necessary logical conditions #####
# but the problem is with splitting the rows or ranges
((m <= x) & (n >= y)).any(axis=0)
)
offset = chunk_size
import numpy as np
mask = np.hstack(mask)
df1[mask]
Could anyone give me an efficient solution for splitting or deleting or ignoring the dataframe rows based on the above conditions?
CodePudding user response:
Your case can be solved using Numpy and Pandas together.
To get results for all your cases, I extended M and E by one pair:
M = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = [( 5, 7), (15,16), (15,18), (21,25), ( 5,25), ( 5,15), (13,20)]
To get result, both as a list and a sequence of pairs (ranges of consecutive numbers, inclusive), you can run:
for m1, m2, e1, e2 in np.hstack([M, E]):
s = pd.Index(np.arange(m1, m2 1)).difference(pd.Index(
np.arange(e1, e2 1))).to_series()
rng = s.groupby(s.subtract(s.shift()).gt(1).cumsum()).apply(
lambda grp: (grp.iloc[0], grp.iloc[-1]))
print(f'{m1:2}, {m2:2}, {e1:2}, {e2:2}\n{s.tolist()}\n{rng.values}\n')
For the above source data I got:
10, 20, 5, 7
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]
10, 20, 15, 16
[10, 11, 12, 13, 14, 17, 18, 19, 20]
[(10, 14) (17, 20)]
10, 20, 15, 18
[10, 11, 12, 13, 14, 19, 20]
[(10, 14) (19, 20)]
10, 20, 21, 25
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]
10, 20, 5, 25
[]
[]
10, 20, 5, 15
[16, 17, 18, 19, 20]
[(16, 20)]
10, 20, 13, 20
[10, 11, 12]
[(10, 12)]
Each 3-rows sequence contains:
- original data (a pair from M and a pair from E),
- the result as a list of numbers,
- the result as a sequence of ramges.