Home > Software design >  Is there a way to have a rolling window, that varies, based on an arbitrary Series of integers of th
Is there a way to have a rolling window, that varies, based on an arbitrary Series of integers of th

Time:04-09

The need arose because I want to calculate something like:

df['result'] = np.where(df['data1'] == True, df["data2"].rolling(window).max(), 0)

However: window is practically df['window']; a Series of data in the same time series; it contains integers and they may look something like [2,2,2,3,3,2,2,4,2,2].

What's the most vectorized or other efficient way to do this?

CodePudding user response:

Setup

import pandas as pd
import numpy as np

np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
    np.random.randint(1, 4, size=a.shape),
    np.arange(len(a)) 1
)

df = pd.DataFrame({'Data': a, 'Window': w})
df

        Data  Window
0  -0.602923       1
1  -1.005579       2
2  -0.703250       3
3  -1.227599       1
4  -0.683756       1
5  -0.670621       2
6  -0.997120       1
7   0.387956       3
8   0.255502       1
9  -0.152361       2
10  1.150534       3
11  0.546298       3
12  0.302936       3
13  0.091674       1
14 -1.964947       1
15 -1.447079       2
16 -1.487828       1
17 -2.539703       1
18 -1.932612       3
19 -4.163049       2

Vectorized variable window rolling maximum

idx_base = np.arange(len(df))
windows = df.Window.to_numpy()
data = df.Data.to_numpy()

idx_max = np.concatenate([
    np.arange(position-(window_size-1), position 1)
    for window_size, position in zip(windows, idx_base)
])

idx_pos = np.repeat(idx_base, windows)
data_windowed = data[idx_max]

out = np.full(data.shape, data.min())
np.maximum.at(out, idx_pos, data_windowed)

The answer is now in the out variable. Let's look

df.Data.iloc[idx_max].groupby(idx_pos).max().to_frame('Pandas').assign(Numpy=out)

      Pandas     Numpy
0  -0.602923 -0.602923
1  -0.602923 -0.602923
2  -0.602923 -0.602923
3  -1.227599 -1.227599
4  -0.683756 -0.683756
5  -0.670621 -0.670621
6  -0.997120 -0.997120
7   0.387956  0.387956
8   0.255502  0.255502
9   0.255502  0.255502
10  1.150534  1.150534
11  1.150534  1.150534
12  1.150534  1.150534
13  0.091674  0.091674
14 -1.964947 -1.964947
15 -1.447079 -1.447079
16 -1.487828 -1.487828
17 -2.539703 -2.539703
18 -1.487828 -1.487828
19 -1.932612 -1.932612

Appendix

Look at idx_max. You'll see how we end up vectorizing this.

print(idx_max)

[           #  Position   Window
  0         #         0        1
  0  1      #         1        2
  0  1  2   #         2        3
  3         #         3        1
  4         #         4        1
  4  5      #         5        2
  6         #         6        1
  5  6  7   #         7        3
  8         #         8        1
  8  9      #         9        2
  8  9 10   #        10        3
  9 10 11   #        11        3
 10 11 12   #        12        3
 13         #        13        1
 14         #        14        1
 14 15      #        15        2
 16         #        16        1
 17         #        17        1
 16 17 18   #        18        3
 18 19      #        19        2
]
  • Related