I have very long time series for which i need to set values within intervals of certain events to np.nan
. measures
is a datetimeindex
ed dataframe and events
is a distinct datetimeindex
distinct.
Measures looks like:
| index | measure |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | 0.107815 |
| 1970-01-01 00:06:24 | 0.046558 |
| 1970-01-01 00:06:25 | 0.056558 |
| 1970-01-01 00:08:12 | 0.837405 |
For example if there were only a single event at timestamp 1970-01-01 00:06:21
and the intervals for removing values was /- 5 seconds, the output would be:
| index | measure |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | np.nan |
| 1970-01-01 00:06:24 | np.nan |
| 1970-01-01 00:06:25 | np.nan |
| 1970-01-01 00:08:12 | 0.837405 |
Currently I'm interating over the events using .loc
:
for i in range(events.shape[0]):
measures.loc[events[i] - pd.Timedelta("4min"):\
events[i] pd.Timedelta("1min") \
] = np.nan
Now this works but takes too long both dataframes are large (events : 10k rows, measures 1.5m rows). Therefore I couldn't construct a boolean index like so:
measure_index = measures.index.to_numpy()
left_bounds = (events - pd.Timedelta("4min")).to_numpy()
right_bounds = (events pd.Timedelta("1min")).to_numpy()
# The following product wouldn't fit in memory even with boolean dtype.
left_bool_array = measure_index >= left_bounds.reshape((-1,1))
right_bool_array = measure_index <= right_bounds.reshape((-1,1))
mask = np.sum( left_bool_array & right_bool_array.T ,axis= 0)
Left joining the events on measures or reindexing events is also out of the question as they take too long.
I then ran into pd.intervalindex:
left_bound = events - pd.Timedelta("4min")
right_bound = events pd.Timedelta("1min")
interval_index=pd.IntervalIndex.from_arrays(left_bound,right_bound)
Intervalindex
index has .contains()
method which takes a scalar and returns "a boolean mask whether the value is contained in the Intervals". However for my use case I'd need to loop trough the measures frame and sum the boolean array for each row. I'm looking for a method like so:
pandas.IntervalIndex.intersect(input: array_like) -> boolean_array (same shape as input)
With each element in the output representing whether the corresponding input value is in any of the intervals.
Similar but different questions:
Interval lookup with interval index: Fastest way to merge pandas dataframe on ranges
Quite similar but the suggested solutions (merges) are not applicable Match IntervalIndex as part of a MultiIndex
If only I had the same indexes and a single interval per row to lookup Best way to join / merge by range in pandas
Edit performance of options discussed in below answer:
len(events) = 10000, len(measures) = 1525229
- pandas
.loc
: 10.5 seconds
for _ in range(10):
left_bound = dilution_copy.index - pd.Timedelta("4min")
right_bound = dilution_copy.index pd.Timedelta("1min")
for left,right in zip(left_bound,right_bound):
measure_copy.loc[left:right]=np.nan
- Staircase : 13.9 seconds
for _ in range(10):
sf = sc.Stairs(start=measure_copy.index, end = measure_copy.index[1:], value=measure_copy.values)
mask = sc.Stairs(start=dilution_copy.index-pd.Timedelta('4 min'), end=dilution_copy.index pd.Timedelta('1 min'))
masked = sf.mask(mask)
result = masked.sample(measure_copy.index, include_index=True)
- Bisect
.iloc
: 35.1 seconds
for _ in range(10):
left_bound = dilution_copy.index - pd.Timedelta("4min")
right_bound = dilution_copy.index pd.Timedelta("1min")
for left,right in zip(left_bound,right_bound):
measure_copy.iloc[bisect(measure_copy.index, left):bisect(measure_copy.index, right)]=np.nan
CodePudding user response:
If your measures
data is already sorted (or if it is not too time consuming to sort once) - you can think of using bisect
.
Here is an approximate more complete solution:
- Check every element of
events
for where it can be "inserted" in themeasures
- Check if the timestamps on either side of this "insertion point" are within 5 seconds
- If yes, set to nan
def bisect_loop():
for event in events:
bisect_point = bisect.bisect(measures.index, event)
keep_looking_lower = True
while keep_looking_lower:
lower_side_index = max(0, bisect_point - 1)
lower_side_diff = event - measures.index[lower_side_index]
if lower_side_diff.seconds < 5:
measures.loc[measures.index[lower_side_index]] = np.nan
bisect_point = max(0, bisect_point - 1)
elif lower_side_diff.seconds >=5 or bisect_point == 0:
keep_looking_lower = False
keep_looking_higher = True
while keep_looking_higher:
higher_side_index = min(len(measures.index), bisect_point)
higher_side_diff = event - measures.index[higher_side_index]
if higher_side_diff.seconds < 5:
measures.loc[measures.index[higher_side_index]] = np.nan
bisect_point = min(len(measures.index), bisect_point 1)
elif higher_side_diff.seconds >=5 or bisect_point == len(measures.index):
keep_looking_higher = False
Here are some stats on a dummy dataset with 150 measures and 10 events -
df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(1), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})
timestamps = pd.to_datetime(df)
measures = pd.concat([timestamps, pd.Series(np.random.rand(150))], axis=1)
measures = measures.set_index(0)
measures = measures.sort_index()
df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(24), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})
events = pd.to_datetime(df).sample(10).reset_index(drop=True)
%timeit op_loop() # This is your loc based approach that is working
8.74 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit bisect_loop()
3.22 ms ± 45.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CodePudding user response:
Can be solved with staircase, a package built on pandas and numpy for working with (mathematical) step functions.
Setup:
measures = pd.Series(
[0.471331, 0.069177, 0.955357, 0.107815, 0.046558, 0.056558, 0.837405],
index = pd.DatetimeIndex([
pd.Timestamp("1970-1-1 00:00:15"),
pd.Timestamp("1970-1-1 00:02:37"),
pd.Timestamp("1970-1-1 00:03:59"),
pd.Timestamp("1970-1-1 00:06:17"),
pd.Timestamp("1970-1-1 00:06:24"),
pd.Timestamp("1970-1-1 00:06:25"),
pd.Timestamp("1970-1-1 00:08:12"),
])
)
events = pd.DatetimeIndex(['1970-01-01 00:06:21'])
Solution:
import pandas as pd
import staircase as sc
sf = sc.Stairs(start=measures.index, end = measures.index[1:], value=measures.values)
mask = sc.Stairs(start=events-pd.Timedelta('5 seconds'), end=events pd.Timedelta('5 seconds'))
masked = sf.mask(mask)
result = masked.sample(measures.index, include_index=True)
Why it works
1st line: create a step function which is made up of intervals whose endpoints are the index of measures
. The last interval, which starts at 1970-01-01 00:08:12, has no end point and will be of infinite length
2nd line: create a step function where the times in your events
variable are the centers of intervals in the step function, and the endpoints are /- 5 seconds from the center. No problem at all if any intervals overlap.
3rd line: mask the first step function with the second step function, which sets values in the first step function to NaN whenever the second step function is non-zero
4th line: evaluates the masked
step function at your event times