Pandas : Datetimeindex and Intervalindex intersection-CodePudding

I have very long time series for which i need to set values within intervals of certain events to np.nan. measures is a datetimeindexed dataframe and events is a distinct datetimeindex distinct.

Measures looks like:

| index               | measure  |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | 0.107815 |
| 1970-01-01 00:06:24 | 0.046558 |
| 1970-01-01 00:06:25 | 0.056558 |
| 1970-01-01 00:08:12 | 0.837405 |

For example if there were only a single event at timestamp 1970-01-01 00:06:21 and the intervals for removing values was /- 5 seconds, the output would be:

| index               | measure  |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | np.nan   |
| 1970-01-01 00:06:24 | np.nan   |
| 1970-01-01 00:06:25 | np.nan   |
| 1970-01-01 00:08:12 | 0.837405 |

Currently I'm interating over the events using .loc:

for i in range(events.shape[0]):
    measures.loc[events[i] - pd.Timedelta("4min"):\
                 events[i]   pd.Timedelta("1min") \
        ] = np.nan

Now this works but takes too long both dataframes are large (events : 10k rows, measures 1.5m rows). Therefore I couldn't construct a boolean index like so:

measure_index = measures.index.to_numpy()
left_bounds = (events - pd.Timedelta("4min")).to_numpy()
right_bounds = (events   pd.Timedelta("1min")).to_numpy()
# The following product wouldn't fit in memory even with boolean dtype.
left_bool_array = measure_index >= left_bounds.reshape((-1,1)) 
right_bool_array = measure_index <= right_bounds.reshape((-1,1))
mask = np.sum( left_bool_array & right_bool_array.T ,axis= 0)

Left joining the events on measures or reindexing events is also out of the question as they take too long.

I then ran into pd.intervalindex:

left_bound = events - pd.Timedelta("4min")
right_bound = events   pd.Timedelta("1min")
interval_index=pd.IntervalIndex.from_arrays(left_bound,right_bound)

Intervalindex index has .contains() method which takes a scalar and returns "a boolean mask whether the value is contained in the Intervals". However for my use case I'd need to loop trough the measures frame and sum the boolean array for each row. I'm looking for a method like so:

pandas.IntervalIndex.intersect(input: array_like) -> boolean_array (same shape as input)

With each element in the output representing whether the corresponding input value is in any of the intervals.

Similar but different questions:

Interval lookup with interval index: Fastest way to merge pandas dataframe on ranges
Quite similar but the suggested solutions (merges) are not applicable Match IntervalIndex as part of a MultiIndex
If only I had the same indexes and a single interval per row to lookup Best way to join / merge by range in pandas

Edit performance of options discussed in below answer:

len(events) = 10000, len(measures) = 1525229

pandas .loc : 10.5 seconds

for _ in range(10):  
    left_bound = dilution_copy.index - pd.Timedelta("4min")
    right_bound = dilution_copy.index   pd.Timedelta("1min")

    for left,right in zip(left_bound,right_bound):
        measure_copy.loc[left:right]=np.nan

Staircase : 13.9 seconds

for _ in range(10):  
    sf = sc.Stairs(start=measure_copy.index, end = measure_copy.index[1:], value=measure_copy.values)
    mask = sc.Stairs(start=dilution_copy.index-pd.Timedelta('4 min'), end=dilution_copy.index pd.Timedelta('1 min'))
    masked = sf.mask(mask)
    result = masked.sample(measure_copy.index, include_index=True)

Bisect .iloc: 35.1 seconds

for _ in range(10):
    left_bound = dilution_copy.index - pd.Timedelta("4min")
    right_bound = dilution_copy.index   pd.Timedelta("1min")

    for left,right in zip(left_bound,right_bound):
        measure_copy.iloc[bisect(measure_copy.index, left):bisect(measure_copy.index, right)]=np.nan

CodePudding user response：

If your measures data is already sorted (or if it is not too time consuming to sort once) - you can think of using bisect.

Here is ~~an approximate~~ more complete solution:

Check every element of events for where it can be "inserted" in the measures
Check if the timestamps on either side of this "insertion point" are within 5 seconds
If yes, set to nan

def bisect_loop():
    for event in events:
        bisect_point = bisect.bisect(measures.index, event)
        keep_looking_lower = True
        while keep_looking_lower:
            lower_side_index = max(0, bisect_point - 1)
            lower_side_diff = event - measures.index[lower_side_index]
            if lower_side_diff.seconds < 5:
                measures.loc[measures.index[lower_side_index]] = np.nan
                bisect_point = max(0, bisect_point - 1)
            elif lower_side_diff.seconds >=5 or bisect_point == 0:
                keep_looking_lower = False
        keep_looking_higher = True
        while keep_looking_higher:
            higher_side_index = min(len(measures.index), bisect_point)
            higher_side_diff = event - measures.index[higher_side_index]
            if higher_side_diff.seconds < 5:
                measures.loc[measures.index[higher_side_index]] = np.nan
                bisect_point = min(len(measures.index), bisect_point   1)
            elif higher_side_diff.seconds >=5 or bisect_point == len(measures.index):
                keep_looking_higher = False

Here are some stats on a dummy dataset with 150 measures and 10 events -

df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(1), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})

timestamps = pd.to_datetime(df)
measures = pd.concat([timestamps, pd.Series(np.random.rand(150))], axis=1)
measures = measures.set_index(0)
measures = measures.sort_index()

df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(24), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})
events = pd.to_datetime(df).sample(10).reset_index(drop=True)

%timeit op_loop() # This is your loc based approach that is working
8.74 ms Â± 126 Âµs per loop (mean Â± std. dev. of 7 runs, 100 loops each)


%timeit bisect_loop()
3.22 ms Â± 45.8 Âµs per loop (mean Â± std. dev. of 7 runs, 100 loops each)

CodePudding user response：

Can be solved with staircase, a package built on pandas and numpy for working with (mathematical) step functions.

Setup:

measures = pd.Series(
    [0.471331, 0.069177, 0.955357, 0.107815, 0.046558, 0.056558, 0.837405],
    index = pd.DatetimeIndex([
        pd.Timestamp("1970-1-1 00:00:15"),
        pd.Timestamp("1970-1-1 00:02:37"),
        pd.Timestamp("1970-1-1 00:03:59"),
        pd.Timestamp("1970-1-1 00:06:17"),
        pd.Timestamp("1970-1-1 00:06:24"),
        pd.Timestamp("1970-1-1 00:06:25"),
        pd.Timestamp("1970-1-1 00:08:12"),
    ])
)

events = pd.DatetimeIndex(['1970-01-01 00:06:21'])

Solution:

import pandas as pd
import staircase as sc

sf = sc.Stairs(start=measures.index, end = measures.index[1:], value=measures.values)
mask = sc.Stairs(start=events-pd.Timedelta('5 seconds'), end=events pd.Timedelta('5 seconds'))
masked = sf.mask(mask)
result = masked.sample(measures.index, include_index=True)

Why it works

1st line: create a step function which is made up of intervals whose endpoints are the index of measures. The last interval, which starts at 1970-01-01 00:08:12, has no end point and will be of infinite length

2nd line: create a step function where the times in your events variable are the centers of intervals in the step function, and the endpoints are /- 5 seconds from the center. No problem at all if any intervals overlap.

3rd line: mask the first step function with the second step function, which sets values in the first step function to NaN whenever the second step function is non-zero

4th line: evaluates the masked step function at your event times