Home > database >  Most efficient way to do rolling window on a datetimeindex with a data offset from the index
Most efficient way to do rolling window on a datetimeindex with a data offset from the index

Time:09-14

I am trying to calculate statistics over a shifted/offset rolling window of an inconsistent datetimeindex of a dataset in a pandas dataframe. I want to bring these statistics back to the current datetimeindex. I have a solution but it is computationally inefficient and impractical to run over my large dataset of millions of rows.

Here is a sample of what I want and my method to achieve it.

df = pd.DataFrame({'Col1': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]},
                   index=pd.DatetimeIndex(['2022-05-25T00:20:00.930','2022-05-25T00:20:01.257','2022-05-25T00:20:01.673','2022-05-25T00:20:03.125','2022-05-25T00:20:04.190',
                                           '2022-05-25T00:20:04.555','2022-05-25T00:20:04.923','2022-05-25T00:20:05.773','2022-05-25T00:20:05.989','2022-05-25T00:20:06.224'],yearfirst=True))

df:   
         Index             Col1
    2022-05-25 00:20:00.930    10
    2022-05-25 00:20:01.257    15
    2022-05-25 00:20:01.673    20
    2022-05-25 00:20:03.125    25
    2022-05-25 00:20:04.190    30
    2022-05-25 00:20:04.555    35
    2022-05-25 00:20:04.923    40
    2022-05-25 00:20:05.773    45
    2022-05-25 00:20:05.989    50
    2022-05-25 00:20:06.224    55

With the above dataset, this is my method to get a shifted rolling window at each index.

df['Col1 Avg'] = 0.0

for row in df.index:
   
    offset_t = datetime.timedelta(seconds=1.5)
    window_t = datetime.timedelta(seconds=1)
    beg = row-offset_t-window_t
    end = row-offset_t window_t
    
    df['Col1 Avg'].loc[row:row] = df['Col1'].loc[beg:end].mean()

df:
              Index           Col1  Col1 Avg
    2022-05-25 00:20:00.930    10      NaN
    2022-05-25 00:20:01.257    15      NaN
    2022-05-25 00:20:01.673    20     10.0
    2022-05-25 00:20:03.125    25     15.0
    2022-05-25 00:20:04.190    30     25.0
    2022-05-25 00:20:04.555    35     25.0
    2022-05-25 00:20:04.923    40     27.5
    2022-05-25 00:20:05.773    45     35.0
    2022-05-25 00:20:05.989    50     35.0
    2022-05-25 00:20:06.224    55     35.0

Is there a way to do this more efficiently? This takes ~5 minutes for just 10,0000 rows whereas a standard rolling window is <0.05 seconds.

Something like this seems like it should work but doesn't (I think) because of the inconsistent datetimeindex entries.

df['shifted avg'] = df['Col1'].shift(-1,freq=offset_t).rolling('2s').mean()

df:

          Index            Col1 Col1 Avg    shifted avg
2022-05-25 00:20:00.930    10     NaN            NaN
2022-05-25 00:20:01.257    15     NaN            NaN
2022-05-25 00:20:01.673    20    10.0            NaN
2022-05-25 00:20:03.125    25    15.0            NaN
2022-05-25 00:20:04.190    30    25.0            NaN
2022-05-25 00:20:04.555    35    25.0            NaN
2022-05-25 00:20:04.923    40    27.5            NaN
2022-05-25 00:20:05.773    45    35.0            NaN
2022-05-25 00:20:05.989    50    35.0            NaN
2022-05-25 00:20:06.224    55    35.0            NaN

CodePudding user response:

If you resample to 1ms, you can then take a 2 second rolling and offset by 500ms. Then since you have a record for every ms, and that's the resolution of your original index, you can merge them together to get the correct answers.

import pandas as pd

df = pd.DataFrame({'Col1': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]},
                   index=pd.DatetimeIndex(['2022-05-25T00:20:00.930','2022-05-25T00:20:01.257','2022-05-25T00:20:01.673','2022-05-25T00:20:03.125','2022-05-25T00:20:04.190',
                                           '2022-05-25T00:20:04.555','2022-05-25T00:20:04.923','2022-05-25T00:20:05.773','2022-05-25T00:20:05.989','2022-05-25T00:20:06.224'],yearfirst=True))


df = df.merge(df.resample('1ms')
                .min()
                .rolling('2S')
                .mean()
                .shift(500)
                .rename(columns={'Col1':'Col1 Avg'}),
              left_index=True, 
              right_index=True)

print(df)

Output

                        Col1  Col1 Avg
2022-05-25 00:20:00.930    10       NaN
2022-05-25 00:20:01.257    15       NaN
2022-05-25 00:20:01.673    20      10.0
2022-05-25 00:20:03.125    25      15.0
2022-05-25 00:20:04.190    30      25.0
2022-05-25 00:20:04.555    35      25.0
2022-05-25 00:20:04.923    40      27.5
2022-05-25 00:20:05.773    45      35.0
2022-05-25 00:20:05.989    50      35.0
2022-05-25 00:20:06.224    55      35.0
  • Related