How to efficiently check if an integer exist in a multiple range value in Python-CodePudding

The objective is to perform a particular procedure if the counter (i.e., idx) is within any of multiple range

In this case, the range is originated from a df, as below

df=pd.DataFrame(dict(rbot=[4,20],rtop=[8,25]))

For example, certain activity is fired if the counter integer value is within (4-8) OR (20,25).

The following code should answer the following objective

import pandas as pd

df=pd.DataFrame(dict(rbot=[4,20],rtop=[8,25]))

r_bot=df['rbot'].values.tolist()
r_top=df['rtop'].values.tolist()
for idx in range (120):
    h=[True for x,y in zip(r_bot,r_top) if x <= idx <=y ]

    if True in h:
        print(f'Do some operation with  {idx}')

Which produce the following output

Do some operation with  4
Do some operation with  5
Do some operation with  6
Do some operation with  7
Do some operation with  8
Do some operation with  20
Do some operation with  21
Do some operation with  22
Do some operation with  23
Do some operation with  24
Do some operation with  25

In actual implementation, the range pairs can be up to hundreds,whereas, the counter can be up to hundreds thousand. Hence, I am wondering whether this is more efficient of doing this?

CodePudding user response：

one option is with pandas cut and an interval index:

arr = np.arange(120)
intervals = pd.IntervalIndex.from_arrays(df.rbot, df.rtop, closed='both')

out = pd.cut(arr, intervals)

out = arr[pd.notna(out)]

for idx in out:
    print(f'Do some operation with  {idx}')


Do some operation with  4
Do some operation with  5
Do some operation with  6
Do some operation with  7
Do some operation with  8
Do some operation with  20
Do some operation with  21
Do some operation with  22
Do some operation with  23
Do some operation with  24
Do some operation with  25

You can skip the out and just iterate through, again it depends on your end goal :


for idx in arr:
    if intervals.contains(idx).any():
        print(f'Do some operation with  {idx}')

Thanks to @user2246849 for the tests, which I think you should have a look and see if it meets your needs.

CodePudding user response：

You could try numpy broadcasting to create a boolean mask that returns True for the indices that fall between each pair of rbot and rtop values. Then multiply it with the range to get the relevant values. Finally, use flatnonzero to select the True values:

import numpy as np
arr = np.arange(120)
msk = ((df[['rbot']].to_numpy() <= arr) & (arr <= df[['rtop']].to_numpy())).sum(axis=0)
out = np.flatnonzero(msk*arr)
for idx in out:
    print(f'Do some operation with  {idx}')

Output:

Do some operation with  4
Do some operation with  5
Do some operation with  6
Do some operation with  7
Do some operation with  8
Do some operation with  20
Do some operation with  21
Do some operation with  22
Do some operation with  23
Do some operation with  24
Do some operation with  25

CodePudding user response：

There's a lot of ways to approach this, here's one~

df = pd.DataFrame(dict(rbot=[4,20],rtop=[8,25]))
df.rtop  = 1
for idx in range(120):
    if any(idx in range(*df.iloc[x]) for x in df.index):
        print(f'Do some operation with  {idx}')

Output:

Do some operation with  4
Do some operation with  5
Do some operation with  6
Do some operation with  7
Do some operation with  8
Do some operation with  20
Do some operation with  21
Do some operation with  22
Do some operation with  23
Do some operation with  24
Do some operation with  25

CodePudding user response：

FYI, if you exclusively want to perform an operation for each valid index and not intend to perform any additional aggregations later that would require pandas, this is way faster and memory efficient:

import pandas as pd

rbot = [i*1000 for i in range(10000)]
rtop = [(i 1)*1000-2 for i in range(10000)]
main_range = (0, 120)

df=pd.DataFrame(dict(rbot=[4,20],rtop=[8,25]))

intervals = zip(df['rbot'], df['rtop'])
for i in intervals:
    overlap = range(max(main_range[0], i[0]), min(main_range[1], i[-1]) 1)
    for idx in overlap:
         print(f'Do some operation with  {idx}')

Just compute the overlap of the main range with the subranges.

Do some operation with  4
Do some operation with  5
Do some operation with  6
Do some operation with  7
Do some operation with  8
Do some operation with  20
Do some operation with  21
Do some operation with  22
Do some operation with  23
Do some operation with  24
Do some operation with  25

Runtimes with larger dataset:

import pandas as pd
import numpy as np

rbot = [i*1000 for i in range(10000)]
rtop = [(i 1)*1000-2 for i in range(10000)]
main_range = (0, 120000)

df = pd.DataFrame({'rbot': rbot, 'rtop': rtop})

def python():
    intervals = zip(df['rbot'], df['rtop'])
    for i in intervals:
        overlap = range(max(main_range[0], i[0]), min(main_range[1], i[-1]) 1)
        for idx in overlap:
            pass#print(idx)

# 5.03 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit python()

def pandas():
    arr = np.arange(*main_range)
    
    intervals = pd.IntervalIndex.from_arrays(df.rbot, df.rtop, closed='both')

    out = pd.cut(arr, intervals)

    out = arr[pd.notna(out)]
    
    for idx in out:
        pass#print(idx)

# 67 ms ± 467 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pandas()


def numpy():
    arr = np.arange(*main_range)
    msk = ((df[['rbot']].to_numpy() <= arr) & (arr <= df[['rtop']].to_numpy())).sum(axis=0)
    out = np.flatnonzero(msk*arr)
    for idx in out:
        pass#print(idx)

# 2.77 s ± 7.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    
%timeit numpy()