Is there a very fast (vectorized) way to calculate the equivalent of rolling(window).max() but the w-CodePudding

I wanted to practically calculate a basic dataframe.column.rolling(window).max() but the window is another column of arbitrary integers derived at an earlier stage.

However: methods similar to those found here or here or here appear to be extremely slow to the point of being unusable when the dataframe is large.

I suspect it's because SIMD hardware may prefer a constant nature of window sizes but I wonder if there is a way I miss.

Example data (as found in the first method linked above):

import pandas as pd
import numpy as np

np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
    np.random.randint(1, 4, size=a.shape),
    np.arange(len(a)) 1
)

df = pd.DataFrame({'Data': a, 'Window': w})
df 
        Data  Window
0  -0.602923       1
1  -1.005579       2
2  -0.703250       3
3  -1.227599       1
4  -0.683756       1
5  -0.670621       2
6  -0.997120       1
7   0.387956       3
8   0.255502       1
9  -0.152361       2
10  1.150534       3
11  0.546298       3
12  0.302936       3
13  0.091674       1
14 -1.964947       1
15 -1.447079       2
16 -1.487828       1
17 -2.539703       1
18 -1.932612       3
19 -4.163049       2

Expected result of an equivalent rolling window max:

0  -0.602923 
1  -0.602923
2  -0.602923 
3  -1.227599 
4  -0.683756 
5  -0.670621 
6  -0.997120 
7   0.387956 
8   0.255502 
9   0.255502  
10  1.150534  
11  1.150534 
12  1.150534
13  0.091674 
14 -1.964947 
15 -1.447079 
16 -1.487828
17 -2.539703 
18 -1.487828 
19 -1.932612

CodePudding user response：

The fastest solution for pandas 1.3.5 I could find is a custom indexer

import pandas as pd
from pandas import api
import numpy as np

class MultiWindowIndexer(api.indexers.BaseIndexer):
    def __init__(self, window):
        self.window = np.array(window)
        super().__init__()

    def get_window_bounds(self, num_values, min_periods, center, closed):
        end = np.arange(num_values, dtype='int64')   1
        start = np.clip(end - self.window, 0, num_values)
        return start, end

Using the cython implementation of max (which is the default) with 2*10**6 elements, windows of maximal length of 10

np.random.seed([3,14])
a = np.random.randn(2*10**6).cumsum()
w = np.minimum(
    np.random.randint(1, 10, size=a.shape),
    np.arange(len(a)) 1
)

df = pd.DataFrame({'Data': a, 'Window': w})

df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')
# %timeit 10 loops, best of 5: 116 ms per loop

The numba engine is ~2.4x slower

df['max2'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='numba')
# %timeit 1 loop, best of 5: 278 ms per loop

[Update: presumably not fixed in pandas 1.4.x]
Oddly enough, the results are different. Unfortunately the cython result is not correct as it uses wrong starting indices for the maximum.

np.testing.assert_allclose(df.max2, df.max1)

Output

Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 448192 / 2000000 (22.4%)

Analysis of the bug

The cython implementation seems to remember the largest starting index encountered so far and 'clips' smaller starting indices to the stored value.
More technically correct: only stores the range of the largest start and largest end indices encountered so far in a queue, discarding smaller start indices and making them unavailable.

CodePudding user response：

Here's a NumPy-only vectorized method. It isn't as fast as @MichaelSzczesny's custom indexer with the Cython engine, but it might be useful if it turns out that the bug in the Pandas code hasn't been fixed yet.

def windowed_max(a, w):
    k = np.arange(1, len(w) 1)
    windows = np.column_stack((k - w, k))
    return np.maximum.reduceat(a, windows.ravel()[:-1])[::2]

For example,

In [350]: np.random.seed([3,14])
     ...: a = np.random.randn(20).cumsum()
     ...: w = np.minimum(
     ...:     np.random.randint(1, 4, size=a.shape),
     ...:     np.arange(len(a)) 1
     ...: )

In [351]: windowed_max(a, w)
Out[351]: 
array([-0.60292337, -0.60292337, -0.60292337, -1.227599  , -0.68375631,
       -0.67062118, -0.99711965,  0.38795643,  0.25550199,  0.25550199,
        1.15053366,  1.15053366,  1.15053366,  0.09167441, -1.96494674,
       -1.44707895, -1.48782801, -2.53970325, -1.48782801, -1.93261164])

In [352]: windowed_max([2, 0, 1], [1, 1, 3])
Out[352]: array([2, 0, 2])

To use it with your Pandas Dataframe, you could write:

df['wmax'] =  windowed_max(df.Data.to_numpy(), df.Window.to_numpy())