Efficient solution of dataframe range filtering based on another ranges-CodePudding

I tried the following code to find the range of a dataframe not within the range of another dataframe. However, it takes more than a day to compute the large files because, in the last 2 for-loops, it's comparing each row. Each of my 24 dataframes has around 10^8 rows. Is there any efficient alternative to the following approach?

Please refer to this thread for a better understanding of my I/O: Return the range of a dataframe not within a range of another dataframe

My approach:
I created the tuple pairs from the (df1['first.start'], df1['first.end']) and (df2['first.start'], df2['first.end']) initially in order to apply the range() function. After that, I put a condition whether df1_ranges are in the ranges of df2_ranges or not. Here the edge case was df1['first.start'] = df1['first.end']. I collected the filtered indices from the iterations and then passed into the df1.

df2_lst=[]
for i,j in zip(temp_df2['first.start'], temp_df2['first.end']):
    df2_lst.append(i)
    df2_lst.append(j)
df1_lst=[]
for i,j in zip(df1['first.start'], df1['first.end']):
    df1_lst.append(i)
    df1_lst.append(j)

def range_subset(range1, range2):
    """Whether range1 is a subset of range2."""
    if not range1:
        return True  # empty range is a subset of anything
    if not range2:
        return False  # non-empty range can't be a subset of empty range
    if len(range1) > 1 and range1.step % range2.step:
        return False  # must have a single value or integer multiple step
    return range1.start in range2 and range1[-1] in range2

##### FUNCTION FOR CREATING CHUNKS OF LISTS ####
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i],lst[i 1]

df1_lst2 = list(chunks(df1_lst,2))
df2_lst2 = list(chunks(df2_lst,2))
indices=[]
for idx,i in enumerate(df1_lst2): #main list
    x,y = i
    for j in df2_lst2: #filter list
        m,n = j
        if((x!=y) & (range_subset(range(x,y), range(m,n)))): #checking if the main list exists in the filter range or not
            indices.append(idx) #collecting the filtered indices

df1.iloc[indices]

CodePudding user response：

If n and m are the number of rows in df1 and df2, any algorithm needs to make at least n * m comparision to check every range in df1 against every range in df2, The problem with your code as posted is that (a) it has too may intermediate steps and (b) it uses slow Python loops. If you switch to numpy broadcast, which uses highly optimized C loop under the hood, it will be a lot faster.

The downside with numpy broadcast is memory: it will create a comparison matrix of n * m bytes and the size of your problem may run your computer out of memory. We can mitigate that by chunking df1 to trade performance for lower memory usage.

# Sample data
def random_dataframe(size):
    a = np.random.randint(1, 100, 2*size).cumsum()
    return pd.DataFrame({
        'first.start': a[::2],
        'first.end': a[1::2]
    })

n, m = 10_000_000, 1000
np.random.seed(42)
df1 = random_dataframe(n)
df2 = random_dataframe(m)

# ---------------------------

# Prepare the Start and End time of df2 for comparison
# [:, None] raise the array by one dimension, which is necessary
# for array broadcasting
s2 = df2['first.start'].to_numpy()[:, None]
e2 = df2['first.end'].to_numpy()[:, None]

# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []

while offset < len(df1):
    s1 = df1['first.start'].to_numpy()[offset:offset chunk_size]
    e1 = df1['first.end'].to_numpy()[offset:offset chunk_size]
    mask.append(
        ((s2 <= s1) & (s1 <= e2) & (s2 <= e1) & (e1 <= e2)).any(axis=0)
    )
    offset  = chunk_size

mask = np.hstack(mask)

The above code took 30s on my computer. Result:

df1[mask]  # ranges in df1 that are completely surrounded by a range in df2
df1[~mask] # ranges in df1 that are NOT completely surrounded by any range in df2

By tweaking the comparison, you can check for overlapping ranges too.