How to calculate the total number of 1-hour intervals in a sequence of intervals?-CodePudding

Let's consider the following dataframe of sorted time intervals:

import pandas as pd
from io import StringIO

s="""start_time,end_time
2022-01-01 12:30:00,2022-01-01 12:45:00
2022-01-01 13:05:00,2022-01-01 13:50:00
2022-01-01 14:00:00,2022-01-01 14:20:00
2022-01-01 16:00:00,2022-01-01 16:45:00
2022-01-01 17:20:00,2022-01-01 17:35:00
2022-01-01 17:45:00,2022-01-01 18:30:00
2022-01-01 19:00:00,2022-01-01 19:25:00"""
df = pd.read_csv(StringIO(s), sep=",")
df.start_time = pd.to_datetime(df.start_time)
df.end_time = pd.to_datetime(df.end_time)

          start_time               end_time
0    2022-01-01 12:30:00    2022-01-01 12:45:00
1    2022-01-01 13:05:00    2022-01-01 13:50:00
2    2022-01-01 14:00:00    2022-01-01 14:20:00
3    2022-01-01 16:00:00    2022-01-01 16:45:00
4    2022-01-01 17:20:00    2022-01-01 17:35:00
5    2022-01-01 17:45:00    2022-01-01 18:30:00
6    2022-01-01 19:00:00    2022-01-01 19:25:00

The idea is that an 1-hour interval is basically calculated in the following way: we start with the start_time of the first interval and we add 1 hour to it.

If the resulting timestamp is within one of the following intervals that are in the dataframe, then we repeat the process by adding 1-hour to this new timestamp and so on.

If, however, the resulting timestamp is not within, but between two intervals, then we continue by adding 1-hour to the start_time of the next interval.

The input would be the dataframe above.

The process is: We start by adding 1-hour to the start_time of the first interval:

12:30 1H -> 13:30 (13:30 is a timestamp that is within one of the available intervals. In particular, it is within 13:05 - 13:50, which is an interval in our dataframe. We shall, then, continue from 13:30).
13:30 1H -> 14:30 (14:30 is not contained in any of our df intervals - we pick the closest start_time after 14:30)
16:00 1H -> 17:00 (17:00 not included in any interval of our dataframe)
17:20 1H -> 18:20 (18:20 is included between 17:45 - 18:30, which is also an interval that we have in our dataframe)
18:20 1H -> 19:20 (it is included in our last interval)
19:20 1H -> 20:20 (we have surpassed the end_time of our last inteval, so we stop)

Output: 6

The output stands for the total number of times that the process of adding 1H was repeated.

As far as code is concerned I have thought of maybe using .shift() somehow but I am not sure how. The problem is that when the resulting timestamp is not between an available interval, then we should search for the closest following start_time.

CodePudding user response：

Not sure if pandas is really necessary here, but here is a solution following your logic.

from datetime import timedelta
import numpy as np
count = 0
start = df.loc[0,'start_time']
while 1:
    count  = 1
    print("hour interval start:", start)
    end_of_interv = start   timedelta(hours=1)
    new_row = np.searchsorted(df.end_time, end_of_interv)
    if new_row >= len(df):
        break
    s, e = df.loc[new_row, ['start_time', 'end_time']]
    if end_of_interv < s:
        start = s
    elif s < end_of_interv < e:
        start = end_of_interv
print("Number of intervals counted: %d" % count)
#hour interval start: 2022-01-01 12:30:00
#hour interval start: 2022-01-01 13:30:00
#hour interval start: 2022-01-01 16:00:00
#hour interval start: 2022-01-01 17:20:00
#hour interval start: 2022-01-01 18:20:00
#hour interval start: 2022-01-01 19:20:00
#Number of intervals counted: 6

You should test this on a few more examples with different intervals (e.g. some longer than 1 hour) and start times, and verify it produces the answers you seek.

CodePudding user response：

It is unlikely that vectorization (i.e. parallelization) is possible, because the process at each step depends on the result of the calculations at the previous steps. The solution in any case will be some kind of iteration. And the speed of work will depend primarily on the algorithm you choose to work with.

It seems to me that a good algorithm would be to see if the end_time and start_time of neighboring records fall into the same hour step as if we were measuring length by hours starting from some point. For this we can use integer division:

import pandas as pd
from io import StringIO

s = """start_time,end_time
2022-01-01 12:30:00,2022-01-01 12:45:00
2022-01-01 13:05:00,2022-01-01 13:50:00
2022-01-01 14:00:00,2022-01-01 14:20:00
2022-01-01 16:00:00,2022-01-01 16:45:00
2022-01-01 17:20:00,2022-01-01 17:35:00
2022-01-01 17:45:00,2022-01-01 18:30:00
2022-01-01 19:00:00,2022-01-01 19:25:00"""

df = pd.read_csv(StringIO(s), parse_dates=[0, 1])

data = df.to_numpy().flatten()
start = data[0]
step = pd.Timedelta(1, 'H')   # hour as a unit of length
count = 0
for x, y in data[1:-1].reshape(-1, 2):
    # x is previous end_time
    # y is next start_time
    length = (x-start) // step   1
    if start   step*length < y:
        count  = length
        start = y
count  = (data[-1]-start) // step   1

print(f'{count = }')