Home > OS >  Pandas : Datetimeindex and Intervalindex intersection
Pandas : Datetimeindex and Intervalindex intersection

Time:10-19

I have very long time series for which i need to set values within intervals of certain events to np.nan. measures is a datetimeindexed dataframe and events is a distinct datetimeindex distinct.

Measures looks like:

| index               | measure  |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | 0.107815 |
| 1970-01-01 00:06:24 | 0.046558 |
| 1970-01-01 00:06:25 | 0.056558 |
| 1970-01-01 00:08:12 | 0.837405 |

For example if there were only a single event at timestamp 1970-01-01 00:06:21 and the intervals for removing values was /- 5 seconds, the output would be:

| index               | measure  |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | np.nan   |
| 1970-01-01 00:06:24 | np.nan   |
| 1970-01-01 00:06:25 | np.nan   |
| 1970-01-01 00:08:12 | 0.837405 |

Currently I'm interating over the events using .loc:

for i in range(events.shape[0]):
    measures.loc[events[i] - pd.Timedelta("4min"):\
                 events[i]   pd.Timedelta("1min") \
        ] = np.nan

Now this works but takes too long both dataframes are large (events : 10k rows, measures 1.5m rows). Therefore I couldn't construct a boolean index like so:

measure_index = measures.index.to_numpy()
left_bounds = (events - pd.Timedelta("4min")).to_numpy()
right_bounds = (events   pd.Timedelta("1min")).to_numpy()
# The following product wouldn't fit in memory even with boolean dtype.
left_bool_array = measure_index >= left_bounds.reshape((-1,1)) 
right_bool_array = measure_index <= right_bounds.reshape((-1,1))
mask = np.sum( left_bool_array & right_bool_array.T ,axis= 0) 

Left joining the events on measures or reindexing events is also out of the question as they take too long.

I then ran into pd.intervalindex:

left_bound = events - pd.Timedelta("4min")
right_bound = events   pd.Timedelta("1min")
interval_index=pd.IntervalIndex.from_arrays(left_bound,right_bound)

Intervalindex index has .contains() method which takes a scalar and returns "a boolean mask whether the value is contained in the Intervals". However for my use case I'd need to loop trough the measures frame and sum the boolean array for each row. I'm looking for a method like so:

pandas.IntervalIndex.intersect(input: array_like) -> boolean_array (same shape as input)

With each element in the output representing whether the corresponding input value is in any of the intervals.

Similar but different questions:

Edit performance of options discussed in below answer:

len(events) = 10000, len(measures) = 1525229

  • pandas .loc : 10.5 seconds
for _ in range(10):  
    left_bound = dilution_copy.index - pd.Timedelta("4min")
    right_bound = dilution_copy.index   pd.Timedelta("1min")

    for left,right in zip(left_bound,right_bound):
        measure_copy.loc[left:right]=np.nan
  • Staircase : 13.9 seconds
for _ in range(10):  
    sf = sc.Stairs(start=measure_copy.index, end = measure_copy.index[1:], value=measure_copy.values)
    mask = sc.Stairs(start=dilution_copy.index-pd.Timedelta('4 min'), end=dilution_copy.index pd.Timedelta('1 min'))
    masked = sf.mask(mask)
    result = masked.sample(measure_copy.index, include_index=True)
  • Bisect .iloc: 35.1 seconds
for _ in range(10):
    left_bound = dilution_copy.index - pd.Timedelta("4min")
    right_bound = dilution_copy.index   pd.Timedelta("1min")

    for left,right in zip(left_bound,right_bound):
        measure_copy.iloc[bisect(measure_copy.index, left):bisect(measure_copy.index, right)]=np.nan

CodePudding user response:

If your measures data is already sorted (or if it is not too time consuming to sort once) - you can think of using bisect.

Here is an approximate more complete solution:

  • Check every element of events for where it can be "inserted" in the measures
  • Check if the timestamps on either side of this "insertion point" are within 5 seconds
  • If yes, set to nan
def bisect_loop():
    for event in events:
        bisect_point = bisect.bisect(measures.index, event)
        keep_looking_lower = True
        while keep_looking_lower:
            lower_side_index = max(0, bisect_point - 1)
            lower_side_diff = event - measures.index[lower_side_index]
            if lower_side_diff.seconds < 5:
                measures.loc[measures.index[lower_side_index]] = np.nan
                bisect_point = max(0, bisect_point - 1)
            elif lower_side_diff.seconds >=5 or bisect_point == 0:
                keep_looking_lower = False
        keep_looking_higher = True
        while keep_looking_higher:
            higher_side_index = min(len(measures.index), bisect_point)
            higher_side_diff = event - measures.index[higher_side_index]
            if higher_side_diff.seconds < 5:
                measures.loc[measures.index[higher_side_index]] = np.nan
                bisect_point = min(len(measures.index), bisect_point   1)
            elif higher_side_diff.seconds >=5 or bisect_point == len(measures.index):
                keep_looking_higher = False

Here are some stats on a dummy dataset with 150 measures and 10 events -

df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(1), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})

timestamps = pd.to_datetime(df)
measures = pd.concat([timestamps, pd.Series(np.random.rand(150))], axis=1)
measures = measures.set_index(0)
measures = measures.sort_index()

df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(24), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})
events = pd.to_datetime(df).sample(10).reset_index(drop=True)

%timeit op_loop() # This is your loc based approach that is working
8.74 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%timeit bisect_loop()
3.22 ms ± 45.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response:

Can be solved with staircase, a package built on pandas and numpy for working with (mathematical) step functions.

Setup:

measures = pd.Series(
    [0.471331, 0.069177, 0.955357, 0.107815, 0.046558, 0.056558, 0.837405],
    index = pd.DatetimeIndex([
        pd.Timestamp("1970-1-1 00:00:15"),
        pd.Timestamp("1970-1-1 00:02:37"),
        pd.Timestamp("1970-1-1 00:03:59"),
        pd.Timestamp("1970-1-1 00:06:17"),
        pd.Timestamp("1970-1-1 00:06:24"),
        pd.Timestamp("1970-1-1 00:06:25"),
        pd.Timestamp("1970-1-1 00:08:12"),
    ])
)

events = pd.DatetimeIndex(['1970-01-01 00:06:21'])

Solution:

import pandas as pd
import staircase as sc

sf = sc.Stairs(start=measures.index, end = measures.index[1:], value=measures.values)
mask = sc.Stairs(start=events-pd.Timedelta('5 seconds'), end=events pd.Timedelta('5 seconds'))
masked = sf.mask(mask)
result = masked.sample(measures.index, include_index=True)

Why it works

1st line: create a step function which is made up of intervals whose endpoints are the index of measures. The last interval, which starts at 1970-01-01 00:08:12, has no end point and will be of infinite length

2nd line: create a step function where the times in your events variable are the centers of intervals in the step function, and the endpoints are /- 5 seconds from the center. No problem at all if any intervals overlap.

3rd line: mask the first step function with the second step function, which sets values in the first step function to NaN whenever the second step function is non-zero

4th line: evaluates the masked step function at your event times

  • Related