I am trying to come up with a way to filter dataframe so that it contains only certain range of numbers that is needed for further processing. Below is an example dataframe
data_sample = [['part1', 234], ['part2', 224], ['part3', 214],['part4', 114],['part5', 1111],
['part6',1067],['part7',1034],['part8',1457],['part9', 789],['part10',1367],
['part11',467],['part12',367]
]
data_df = pd.DataFrame(data_sample, columns = ['partname', 'sbin'])
data_df['sbin'] = pd.to_numeric(data_df['sbin'], errors='coerce', downcast='integer')
With the above dataframe i want to filter such that any part with sbin in range [200-230] and [1000-1150] and [350-370] and [100-130] are removed.
I have a bigger dataframe with lot more ranges to be removed and hence need a faster way than using below command
data_df.loc[~( ((data_df.sbin >=200) & (data_df.sbin <= 230)) | ((data_df.sbin >=100) & (data_df.sbin <= 130)) | ((data_df.sbin >=350) & (data_df.sbin <= 370))| ((data_df.sbin >=1000) & (data_df.sbin <= 1150)))]
that produces output as below
partname sbin
0 part1 234
7 part8 1457
8 part9 789
9 part10 1367
10 part11 467
The above method requires lot of conditions and takes a long time, i would like to know if there is a better way using regex or some other python way that i am not aware off.
any help would be great
CodePudding user response:
pd.cut
works fine here, especially as your intervals are not overlapping:
intervals = pd.IntervalIndex.from_tuples([(200, 230), (1000, 1150), (350, 370), (100, 130)])
# if the values do not fall within the intervals, it is a null
# hence the isna check to keep only the null matches
# thanks to @corralien for the include_lowest=True suggestion
data_df.loc[pd.cut(data_df.sbin, intervals, include_lowest=True).isna()]
partname sbin
0 part1 234
7 part8 1457
8 part9 789
9 part10 1367
10 part11 467
CodePudding user response:
New version
intervals = [(100, 130), (200, 230), (350, 370), (1000, 1150)]
m = np.any([np.logical_and(data_df['sbin'] >= l, data_df['sbin'] <= u)
for l, u in intervals], axis=0)
out = data_df.loc[np.where(~m)[0]]
Old version
Not work with float numbers as is
Use np.where
and in1d
:
intervals = [(100, 130), (200, 230), (350, 370), (1000, 1150)]
m = np.hstack([np.arange(l, u 1) for l, u in intervals])
out = data_df.loc[np.where(~np.in1d(data_df['sbin'], m))[0]]
>>> out
partname sbin
0 part1 234
7 part8 1457
8 part9 789
9 part10 1367
10 part11 467
Performance: for 100k records:
data_df = pd.DataFrame({'sbin': np.random.randint(0, 2000, 100000)})
def exclude_range_danimesejo():
intervals = sorted([(200, 230), (1000, 1150), (350, 370), (100, 130)])
intervals = np.array(intervals).flatten()
mask = np.searchsorted(intervals, data_df['sbin']) % 2 == 0
return data_df.loc[mask]
def exclude_range_sammywemmy():
intervals = pd.IntervalIndex.from_tuples([(200, 230), (1000, 1150), (350, 370), (100, 130)])
return data_df.loc[pd.cut(data_df.sbin, intervals).isna()]
def exclude_range_corralien():
intervals = [(100, 130), (200, 230), (350, 370), (1000, 1150)]
m = np.hstack([np.arange(l, u 1) for l, u in intervals])
return data_df.loc[np.where(~np.in1d(data_df['sbin'], m))[0]]
def exclude_range_corralien2():
intervals = [(100, 130), (200, 230), (350, 370), (1000, 1150)]
m = np.any([np.logical_and(data_df['sbin'] >= l, data_df['sbin'] <= u)
for l, u in intervals], axis=0)
return data_df.loc[np.where(~m)[0]]
>>> %timeit exclude_range_danimesejo()
2.62 ms ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit exclude_range_sammywemmy()
63.6 ms ± 549 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit exclude_range_corralien()
8.75 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit exclude_range_corralien2()
4.3 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CodePudding user response:
For non-overlapping intervals, you could use np.searchsorted
:
# sort the non overlapping intervals
intervals = sorted([(200, 230), (1000, 1150), (350, 370), (100, 130)])
# flatten
intervals = np.array(intervals).flatten()
# search in the intervals, if the index is even is not in the intervals
mask = (np.searchsorted(intervals, data_df['sbin']) % 2 == 0) & ~np.in1d(data_df['sbin'], intervals[::2])
print(data_df.loc[mask])
Output
partname sbin
0 part1 234
7 part8 1457
8 part9 789
9 part10 1367
10 part11 467
Timings (on a 3,1 GHz Intel Core i7)
%timeit exclude_range_danimesejo()
3.45 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Setup (in addition to the one by @Corralien)
def exclude_range_danimesejo():
intervals = sorted([(200, 230), (1000, 1150), (350, 370), (100, 130)])
intervals = np.array(intervals).flatten()
mask = (np.searchsorted(intervals, data_df['sbin']) % 2 == 0) & ~np.in1d(data_df['sbin'], intervals[::2])
return data_df.loc[mask]
The overall complexity of this approach is O((n m) log n)
where n
is the size of the list of intervals and m
the length of the sbin
column.
CodePudding user response:
What about using a loop to apply your interval conditions?
# this will be broadcast to a boolean array where it is `True` if sbin
# is in any of the intervals in the list and `False` otherwise
is_in_intervals = False
intervals = [(200, 230), (1000, 1150), (350, 370), (100, 130)]
for interval in intervals:
is_in_intervals |= (data_df.sbin >= interval[0]) & (data_df.sbin <= interval[1])
data_df[~is_in_intervals]