I wanted to practically calculate a basic dataframe.column.rolling(window).max()
but the window is another column of arbitrary integers derived at an earlier stage.
However: methods similar to those found here or here or here appear to be extremely slow to the point of being unusable when the dataframe is large.
I suspect it's because SIMD hardware may prefer a constant nature of window sizes but I wonder if there is a way I miss.
Example data (as found in the first method linked above):
import pandas as pd
import numpy as np
np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
np.random.randint(1, 4, size=a.shape),
np.arange(len(a)) 1
)
df = pd.DataFrame({'Data': a, 'Window': w})
df
Data Window
0 -0.602923 1
1 -1.005579 2
2 -0.703250 3
3 -1.227599 1
4 -0.683756 1
5 -0.670621 2
6 -0.997120 1
7 0.387956 3
8 0.255502 1
9 -0.152361 2
10 1.150534 3
11 0.546298 3
12 0.302936 3
13 0.091674 1
14 -1.964947 1
15 -1.447079 2
16 -1.487828 1
17 -2.539703 1
18 -1.932612 3
19 -4.163049 2
Expected result of an equivalent rolling window max:
0 -0.602923
1 -0.602923
2 -0.602923
3 -1.227599
4 -0.683756
5 -0.670621
6 -0.997120
7 0.387956
8 0.255502
9 0.255502
10 1.150534
11 1.150534
12 1.150534
13 0.091674
14 -1.964947
15 -1.447079
16 -1.487828
17 -2.539703
18 -1.487828
19 -1.932612
CodePudding user response:
The fastest solution for pandas 1.3.5
I could find is a custom indexer
import pandas as pd
from pandas import api
import numpy as np
class MultiWindowIndexer(api.indexers.BaseIndexer):
def __init__(self, window):
self.window = np.array(window)
super().__init__()
def get_window_bounds(self, num_values, min_periods, center, closed):
end = np.arange(num_values, dtype='int64') 1
start = np.clip(end - self.window, 0, num_values)
return start, end
Using the cython
implementation of max
(which is the default) with 2*10**6 elements, windows of maximal length of 10
np.random.seed([3,14])
a = np.random.randn(2*10**6).cumsum()
w = np.minimum(
np.random.randint(1, 10, size=a.shape),
np.arange(len(a)) 1
)
df = pd.DataFrame({'Data': a, 'Window': w})
df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')
# %timeit 10 loops, best of 5: 116 ms per loop
The numba
engine is ~2.4x slower
df['max2'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='numba')
# %timeit 1 loop, best of 5: 278 ms per loop
[Update: presumably not fixed in pandas 1.4.x
]
Oddly enough, the results are different. Unfortunately the cython
result is not correct as it uses wrong starting indices for the maximum.
np.testing.assert_allclose(df.max2, df.max1)
Output
Not equal to tolerance rtol=1e-07, atol=0
Mismatched elements: 448192 / 2000000 (22.4%)
Analysis of the bug
The cython
implementation seems to remember the largest starting index encountered so far and 'clips' smaller starting indices to the stored value.
More technically correct: only stores the range of the largest start and largest end indices encountered so far in a queue
, discarding smaller start indices and making them unavailable.
CodePudding user response:
Here's a NumPy-only vectorized method. It isn't as fast as @MichaelSzczesny's custom indexer with the Cython engine, but it might be useful if it turns out that the bug in the Pandas code hasn't been fixed yet.
def windowed_max(a, w):
k = np.arange(1, len(w) 1)
windows = np.column_stack((k - w, k))
return np.maximum.reduceat(a, windows.ravel()[:-1])[::2]
For example,
In [350]: np.random.seed([3,14])
...: a = np.random.randn(20).cumsum()
...: w = np.minimum(
...: np.random.randint(1, 4, size=a.shape),
...: np.arange(len(a)) 1
...: )
In [351]: windowed_max(a, w)
Out[351]:
array([-0.60292337, -0.60292337, -0.60292337, -1.227599 , -0.68375631,
-0.67062118, -0.99711965, 0.38795643, 0.25550199, 0.25550199,
1.15053366, 1.15053366, 1.15053366, 0.09167441, -1.96494674,
-1.44707895, -1.48782801, -2.53970325, -1.48782801, -1.93261164])
In [352]: windowed_max([2, 0, 1], [1, 1, 3])
Out[352]: array([2, 0, 2])
To use it with your Pandas Dataframe, you could write:
df['wmax'] = windowed_max(df.Data.to_numpy(), df.Window.to_numpy())